Skip to content
This repository has been archived by the owner on Sep 9, 2020. It is now read-only.

Integrating with Linux package management systems #1214

Closed
nim-nim opened this issue Sep 23, 2017 · 8 comments
Closed

Integrating with Linux package management systems #1214

nim-nim opened this issue Sep 23, 2017 · 8 comments

Comments

@nim-nim
Copy link

nim-nim commented Sep 23, 2017

Hi,

I'm looking at integrating Go software in my Linux OS package management system.

The basic reason is that Go projects can and do use resources (data, apps, libs) not released as Go projects, and non-Go projects may wish to depend on Go software. A language-specific solution only works as long as one stays within the language ecosystem. And experience shows that subsuming everything else in a language-specific ecosystem is too much work.

In other words I want to jump to step 7 of ¹, while the Go community is working here on step 6.

The targeted system is rpm/dnf (used in Fedora / Red Hat / Centos). rpm has a long history of integrating with multiple language-specific dependency tools. And what works for rpm should work in other Linux systems since years of intense competition made those systems converge on the same features.

To integrate cleanly, I need the following properties:

  1. standard deployment paths or filenames, so rpm detects it needs to run a language-specific dep solver
  2. a command that, when fed the detected paths, reads the content of those paths and outputs the identity of the software artifact (canonical name, version, and other attributes)
  3. a command that, when fed the detected paths, reads the content of those paths and outputs the id and constrains of artifacts needed to use those

Since Go projects have a standard structure in GOPATH, 1. is already satisfied.

I can simulate 2 since creating a package in rpm requires collecting most of the same metadata (though it would be nice to check if the metadata declared by the packager agrees with what the software object thinks about itself)

I could probably write my own Gopkg.toml parser to do 3., but I understand the file format is not stuck in stone and anyway it is much preferred to leave langage-specific magic to langage-specific tools.

As additional constrains, neither 2 nor 3 should change files or perform network calls. 2. because we really do not want to ship material altered from upstream in Fedora, and 3 because we want builds to be reproducible and auditable, allowing network state to change the result is a nogo. I understand that means it won't be possible to sort git commits, so for go packages that declare git commit as deps, we'll have to either force this specific commit or allow any commit in deps.

Therefore I'd like go dep to grow a pure dependency analysis mode with no downloads, no network access and no file writing.

If necessary I can run a preprocessor with file writing but not network access. However some projects won't like that we install the preprocessed files in addition to their stuff, so a pure passive command is preferred.

A dependency in rpm is a simple

  • artifactid
  • artifactid = version
  • artifactid < version
  • artifactid >= version (and so on)

Assuming version uses the usual sanity rules, ie 2 > 2.10 > 2.1.1 etc

More advanced syntax is available but 99% of software lives happily without it
(http://rpm.org/user_doc/boolean_dependencies.html, http://rpm.org/user_doc/dependencies.html)

Currently the packager declares dependencies manually in the golang namespace, so
github.com/matttproud/golang_protobuf_extensions/pbutil will provide the following

  • golang(github.com/matttproud/golang_protobuf_extensions/pbtest) = 1.0.0

and a code that needs it will require

  • golang(github.com/matttproud/golang_protobuf_extensions/pbtest) or
  • golang(github.com/matttproud/golang_protobuf_extensions/pbtest) > 0.9 (to prohibit old versions) or
  • golang(github.com/matttproud/golang_protobuf_extensions/pbtest) = 1.0.0 (to use this specific version)

For the attributes that impact dependency selection but do not participate to canonical version sort I suppose that we'll have to use slightly different tokens.

For branches;

  • golang(something)
  • golang(something)(branch=mybranch)
    with consumers choosing the token they want to match on

For commits

  • golang(something)
  • golang(something)(commit=mycommit)

For commits on specific branches

  • golang(something)
  • golang(something)(branch=mybranch)
  • golang(something)(commit=mycommit)

For freestyle version modifiers

  • golang(something)
  • golang(something)(vermod=alphax)

As an additionnal complexity we're moving unit tests (*_test.go) in specific packages so ii'd be nice if there was a way for the dependency generator to only emit deps for tests or only emit deps without tests.

Would it be possible for go dep to emit something like this? It does not need to match this syntax exactly, as long as there is one line for each dep and the syntax is sane I can rewrite it to rpm format in rpm routines

¹
As far as I've seen most languages go through the following phases as they mature:

  1. I don't need no version or dep management, my latest code is always perfect
  2. Rewriting the world is too much work, I need something to import other's code, the latest version will do
  3. @#* I'm sick of others committing broken code or breaking compat, I need something to import other's code and clamp it to a stable baseline that only needs testing once
  4. @#* My baseline is missing latest fixes/features, my other_project uses about the same with more recent/old versions, I need something to import other's code, figure safe upgrade paths from the latest baseline I tested, and update it to a safe baseline I can share among my projects
  5. People roll competing systems, and major angst ensues when they try to share code
  6. The language ecosystem grows an official solution
    5.a or 6.a The language grows a shared library system. Why rebuild all the time the same code once you have a system that agrees on what is the "good" code version. Plus shipping the same binary objects multiple times sucks for embedded systems and microservices)
  7. Linux people map the official language solution in their system-level software management, so language apps can appear in the system app store, use other apps in the store and be used by other apps in the store

The whole process can take more or less time depending on the language community values and vested interests battling any official solution so they can sell their own management addons, provide consulting to build/install badly integrated code, or profit from bad integration to reinvent wheels because accessing existing wheels is too hard.

@sdboyer
Copy link
Member

sdboyer commented Sep 25, 2017

hi, welcome! whew, there's a lot here.

so, i think that the thing you're going for is probably feasible; your requirements sound similar to the goals of rust-lang/cargo#3815. however, your description here jumps from outlining more general goals to stating specific requirements that, i suspect, will make it harder to actually achieve those goals. in order to try to direct the conversation in what seems likely to be the most helpful way, i'm going to eschew discussion of details in favor of focusing on the big question:

Therefore I'd like go dep to grow a pure dependency analysis mode with no downloads, no network access and no file writing.

dep is built around a solver that is entirely abstracted from I/O - it works on an entirely in-memory, symbolic representation of the problem. this is by design, with the intent that alternative tools can be built.

however, the consistent operation of dep, and the various guarantees it provides, are predicated on having data that is only available via at least some kind of I/O, very often over the network (or at least originally retrieved from the network, but cached locally). this is encapsulated by the SourceManager - so it's possible to build a version of dep that cedes control for furnishing solver information to system-level mechanisms, but a) it'd be quite difficult and b) you'd very likely end up with a tool that isn't mutually compatible with dep. maybe that's a concern for you, maybe it isn't.

but, i also think that creating an entirely new SourceManager implementation is not the right path to take. we have stories, some complete and some underway, around the guarantees you want to provide:

because we really do not want to ship material altered from upstream in Fedora

there's active work underway to introduce a registry concept into the ecosystem: #1172 is the first PR. i suspect that the facilities we build into registries will provide these guarantees, as well as the curation controls (i.e., Fedora packaging folks get to decide which versions of which dependencies are available) that are implicit in your description.

i suspect that the easiest way to provide such a registry would be to create one that does on-the-fly translation from rpm repositories. once we have our registry pattern in place, i'd be fine with having that done entirely locally, with the "registry" just being on the other side of a socket, if that would be suitable.

because we want builds to be reproducible

builds are already strongly reproducible in dep, with facilities in place to . the only thing we don't really defend well against today is a left-pad like event - but in your context, that's unlikely to be a concern, as registries proxying to rpm repositories would obviate that concern.

and auditable

while you'd have to expand on what "auditable" entails for me to be certain, our work on verification (#121) is at least a strong step towards that. now, that model is still largely based on essentially p2p verification, which means that many of TUF's models aren't directly applicable to our situation. however, that basic hashing system ought to be helpful as we continue work on registries, which - being central - put us solidly in TUF's domain.

so...

Would it be possible for go dep to emit something like this?

dep itself is unlikely to emit such information directly, though a standalone tool could probably be built that does. but the path and constraints you propose are likely to be slower and more error prone; if you're amenable to different approaches that afford dep/a parallel tool somewhat more control, then i think it could be both more robust and easier to achieve.

@nim-nim
Copy link
Author

nim-nim commented Sep 25, 2017

Thank you for the answer, it's difficult to find the good level of detail to share when people come from different universes. I'll try to simplify

As i understand it, go deps does

  1. developer downloads root project files or asks go dep to make artifact foo available
  2. go dep parses godep-specific foo project metadata
  3. go dep compute dependencies and constrains, consulting online repos states (in its own solver)
  4. go dep downloads the best solution

A Linux os-level cross-language artifact management splits the steps in the following

  1. artifact maintainer downloads foo project files (and signs them so from this step on malware can not inject data in the process, and the next steps are independant from network/cloud state)
  2. , hopefully as close to the language community computes artifact id and attributes (version and so on) from the project files, what is needed to ship the project files, and what is needed to use the project files. That step is without network access (ideally, for security reasons, in a clean chroot or container with as little installed as possible)
  3. the artfifact files and computed metadata are processed and made available in the system app store
    (with various robots auditing the app store state continuously)
  4. when someone else (not the maintainer or the project dev) requests the artifact, it is downloaded with its requirements, as computed from the system solver (in my case rpm). The system works from the constrains computed in 2. and saved in in artifact metadata.

That's a model that works with cargo, pip, maven and so on. There are only so many ways to do software engineering so baring pathological cases (for example, a language that would consider that version 2 succeeds version 3.0.0) a common solver works well as long as the language-specific constrains are properly translated to the system solver language.

I'd like to use go dep in step 2.

Thus

  1. I need to know if it is better (more future-proof) to start from the metadata file or go dep command invocation to get a go project constrains
  2. I need to translate those constrains to the system solver language, so some sort of go dep grammar description would be welcome

@sdboyer
Copy link
Member

sdboyer commented Sep 26, 2017

so the common theme here is going to be that "this is not easy, because dep does static analysis of code."

As i understand it, go deps does

yes, this is largely correct, with two amendments, one quite crucial. first, less crucial:

asks go dep to make artifact foo available

dep is not involved in the initial fetching of a project - what we typically refer to as the "current" or "root" project.

consulting online repos states

because we have to perform static analysis of the code itself, not just metadata files (Gopkg.toml, Gopkg.lock), we actually clone down upstream VCS repositories and work on local instances of them in a completely isolated area (GOPATH/pkg/dep/sources).

There are only so many ways to do software engineering so baring pathological cases (for example, a language that would consider that version 2 succeeds version 3.0.0) a common solver works well as long as the language-specific constrains are properly translated to the system solver language.

there are going to be some nontrivial problems in performing this translation. there's no really easy way that i can think of to break these down, so i'll just list them out in a way that's hopefully clear enough for the implications to be grokkable:

  1. we separate the notion of requirement ("a dependency MUST be present") from constraint ("IF a dependency is present, its version must be in some set of X"). on its own, this may not be that challenging for translation, but it is an important base fact that combines with others to create challenges.
  2. the unit of dependency is the project (which, in all current cases, is equivalent to a VCS repository). projects themselves declare one set of constraint rules, in Gopkg.toml. but projects are actually trees of packages, and it is crucial that we only pull in the imports of the packages we use from a dependency. that can, of course, happen as the solver incrementally encounters more imports from within a particular project.
  3. constraints are not considered to have transitive power (though this may change - Migrating a library Godeps.json with transitive deps #1124); this suddenly matters when "constraint" and "requirement" are separated, as it means that a constraint merely being present in a Gopkg.toml is not sufficient to guarantee that it should be respected/activated. so, even though Gopkg.toml is stable enough for you to parse (to your question), its meaning and correct interpretation is not so simple. more on these possibilities and absurdities in this gist.
  4. you could treat each individual package as a discrete dependency, but that blows up the graph a lot - and it is an invariant maintained by our solver that all packages from a given dependency project must be at the same precise revision.
  5. it is likely that we will add some basic static analysis and type checking as an additional satisfiability constraint in the future, with the express intent that it reduce the need for expressing the much coarser version constraints. this is potentially a very large attribute space.

further, if GOPATH is is foundational to your strategy, as this seems to suggest:

Since Go projects have a standard structure in GOPATH, 1. is already satisfied.

then...well, it doesn't bode well. the single most major pain point in dependency management for Go
historically has been that entirely unrelated projects are forced to fight over versions of dependencies. that's why dep touches GOPATH (almost) not at all, and works entirely with vendor directories instead. now, dep could give rise to a versioned file space (note, that doc is not official), but that's a ways away. (though i know that Cargo, at least, has a versioned file space that it draws on for its dependencies, so i'm guessing you're not necessarily locked into "there can only be one installed version of X")

in essence, what you're looking for here is for dep to take its domain-specific SMT model and re-encode it in terms of rpm/dnf's more generic SMT for this domain, which is then interpreted by libsolv (at least, if dnf). i have to quote Knuth here:

“Thus the art of problem encoding turns out to be just as important as the art of devising algorithms for satisfiability.” (Volume 4, Fascicle 6, p. 97)

so...yeah, this path is fraught.

yes, we probably could develop a more formally specified grammar for the inputs to dep's solver. but that's not something on our immediate horizon, and it seems likely to me that crucial information would be lost - especially as dep evolves the model.

these difficulties are why i initially suggested that finding a way to afford dep at least a bit more control over the process would likely reduce your pain by a lot - for example, not trying to force dep into a local-only mode. of the goals you've cited thus far, the only one that dep doesn't have support for either today or in the immediate future is signed releases - and that's something we could probably roll in with the registry work.

i understand both the desire to and value of keeping dep in the same box that other such systems have historically fit into. and i get that system package managers are gonna have to make the decision on the basis of what fits most sustainably into their models. i'm just trying to highlight that there's a lot of pain to be had in swimming upstream in this way, and you might be able to satisfy your requirements more easily if you lean more heavily on the facilities dep already provides.

@nim-nim
Copy link
Author

nim-nim commented Sep 27, 2017

Thank you for the nice answer. It makes me cautiously optimistic, as I recognize stuff rpm already knows how to deal with. Even if it's not intentional on go dep's part software engineering constrains force most languages to adopt similar choices as soon as they acknowledge those constrains (Java, for example, was awfully long to admit some of those).

dep is not involved in the initial fetching of a project - what we typically refer to as the
"current" or "root" project. because we have to perform static analysis of the code itself,
not just metadata files (Gopkg.toml, Gopkg.lock), we actually clone down upstream VCS
repositories and work on local instances of them in a completely isolated area (GOPATH/pk/dep/sources).

This is actually pretty much the model expected by rpm

  1. artifact manager downloads the sources of its artifact (what you call the root project)

  2. rpm invokes artifact-specific commands to prepare the files the artifact will make available to other artifacts, and installs them in an isolated area. ie:

  • for languages that use shared libraries, some form of make then make install
  • for languages like go, some form of make install (right now we script this part but some form of standardised go install command would be nice)

Typically the install part targets $buildroot/final_path, then rpm isolates $buildroot so the language-specific solver commands only see a virgin minimal system + artifact files in /final_path + the files of any artifact that the artifact manager declares is necessary in this "build" phase (build as in create an rpm that can be used by other rpms, not necessarily build as in compile code)

  1. rpm invoke a language-specific solver to analyse the files in final_path and compute their requirements. That is what I wanted to use go dep for and you've just confirmed that is also how go dep expects to work, and that just reading some metadata files won't work well.

Right now we use /usr/share/gocode/src/project_tree as final_path, it would be trivial to change is to /usr/share/gocode/pk/dep/sources/project_tree, as long as pk is a generic root not the project name. You can think of an rpm system as a giant shared virtual vendor tree. Only the necessary parts of the tree are installed (and installing the whole tree is not possible since some parts can conflict with others), users of the system can request the parts they need at build, run of test time, and tree parts are not mutable (when you request a part you always get the same part). So you can not have dep(change X property from Y to Z at install time). But the virtual vendor tree can perfectly contain both dep-withX=Y and dep-withX=Z so the non-mutability is actually less constraining that one may think.

Where I suspect rpm and go dep design differ is that rpm is fully recursive, while go dep wants to perform a holistic analysis of the root project and all its deps in one phase (but I may be mistaken about go dep). In the rpm model an artifact is only analysed at creation time and the constrains of artifact foo = intersection of (constrains of foo first-level deps, constrains of foo second-level deps… and so on). Would it be so difficult for go dep to only compute first-level deps, and rely onthe computation it did before for next-level dependencies?

we separate the notion of requirement ("a dependency MUST be present") from
constraint ("IF a dependency is present, its version must be in some set of X").

rpm understands all of "a dependency must be present", "a dependency must be present in set X", and "conflicts with this dependency in this set of Y" (usually sets are either a specific version, more than version X, less than version Y), and a lot more

Now, distribution policy may be to forbid shipping artifacts with some forms of convoluted constrains rpm is technically able to express, because in the distribution experience while those constrains are technically possible, they are bad software engineering and end up in problems some years later. But, let's focus on technical translation here.

though i know that Cargo, at least, has a versioned file space that it draws on for its
dependencies, so i'm guessing you're not necessarily locked into "there can only be one
installed version of X"

As you've discovered choosing with versions to make available is the bone of contention when creating a solver-centered system. This has been debated from the creation of Linux distributions to this day.

It is not possible to have a perfectly coordinated software universe where only a single version of a given component exists at one time (except for very restricted and shallow universes). Therefore, single-version systems are utopia. However, while utopia is not realistic in our imperfect universe, it can be a very useful target to pursue.

In fact, there are huge (hidden, and deferred) costs in shipping every possible version of a component that may have been selected by the other components in the software universe. Yes in free software forking is cheap (staying stuck on one specific version and ignoring later changes is a form of forking, since the project authors move on). It is is also a way to accumulate technical debt. You can not sustain forward momentum for long if there is no technical or social mechanism to force convergence on a limited set of versions for a given component, and make this set progress with time. Ignoring this point bankrupted SUN Java to the point Oracle is still struggling to relaunch the ecosystem.

None of the people advocating "each component is allowed to lock all its deps at specific versions" ever managed to build a sustainable system, even though it is trivial to create technically (just use a versioned and partitioned tree). They start strong but the drag of all past versions, that need shipping, checking and fixing in addition of the ones required by newer components, quickly fossilizes them. No software project is ever finished, there are always new security alerts, new killer features, etc that require reviewing past code, if only to check if it is affected. Besides they tend to accumulate conflicting requirements (though that may be solvable with hard partitioning). And they quickly suffer from a terrible image, to the point software consumers start to actively avoid them. One only needs to be affected by so many problems, which are fixed in foo project's bardep latest version, except foo project version-locked bardep so long ago it is unable and unwilling to port to the fix, to conclude you do not want to touch foo project nor any other project that looks to use the same software engineering rules. Linux distributions sometimes move too slowly for people's taste but they do move on. And forward velocity is just a compromise between level of QA and amount of effort in a distribution case.

In fact some software projects are now asking distributions to lock whole runtimes (sets of dependencies with specific constrains and API) to force their ecosystem to converge on those runtimes, since they found out a single component version per system was still too dynamic to optimize maintenance.

Therefore, even though rpm systems can and do allow multiple versions of the same component, since they need to work on the actual world, policy will usually be to treat everything but the latest packaged version (or major version) as a temporary exception that needs resorption mid-term. Even though that annoys big time software developers which have chosen a different upgrade cadence. It is usually possible to come to a compromise as long as both parties make an effort.

I agree this can be a very unpleasant phase, no party looks forward to compromising and taking other people's needs into account, and people are often passionate distribution and project-side, but it is the cost of sustainability and forward progress. I just wish more people understood this.

From a purely technical point of view, to make multiple go dep versions work in rpm, I would need:

  1. either, assurance that in the go dep universe, building a go project will never involve more than one version of the same component. As I wrote before all parts of the rpm vendor tree to not need to be installed at the same time, so it's perfectly fine to have several parts that use the same filenames and paths
  2. or, if there are scenarii where go dep will need to use several versions of the same component simultaneously, some form of convention on where to install those versions so the filenames do not conflict. And that convention need to be invariant (I can not make the path of an artifact depend on how the solver arrived to this artifact). It can be as simple as install foo go project package in root/foo/major-version/exact-version
    with a "latest" symlink in root/foo/major-version/
    if go wants to allow commit hashes as exact-version"
    as long as go then knows to look in root/foo/major-version/exact-version when go deps computes it needs exact version, and to fallback to root/foo/major-version/latest if it is not available.

The alternative is to have a form of component registry, that go dep consults to learn where artifact foo version bar is on the filesystem. Several software ecosystems work like this. However since one still needs to install the files somewhere, someone will still need to come up with an installation convention, and in my experience reading and writing to the registry usually turns out a lot more complex and brittle than agreeing on where the files should be in the first place.

“Thus the art of problem encoding turns out to be just as important as the art of devising algorithms for satisfiability.”

Yes, right, I don't worry about the rpm solver engine, it has proved itself time and time again (and is periodically rewritten for features and performance), the difficulty is to encode the constrains, and choose sane constrains. You can encode perfectly bad software engineering it stays bad software engineering.

Choosing sane constrains is a discussion we need to have with each individual project.

Encoding the constrains is the discussion we are having here. I don't really need a

more formally specified grammar for the inputs to dep's solver

I need the dep solver to output the result of its computation somewhere (command output or result file) so I can scrap it and translate to rpm speak. I need to understand the output format so I make the best possible translation for all parties and I need some way to be alerted when there are additions or changes to the output format.

While I do agree that in theory, there are so many things that can go wrong such translation is frightening to contemplate, in my experiences projects that do not come from a solver culture start by over-specifying their needs ("defensive" constrain programming), before settling on a "good enough" command subset which is more or less always the same and not that difficult to translate.

I understand both the desire to and value of keeping dep in the same box that other such
systems have historically fit into. and i get that system package managers are gonna
have to make the decision on the basis of what fits most sustainably into their models.
I'm just trying to highlight that there's a lot of pain to be had in swimming upstream in
this way, and you might be able to satisfy your requirements more easily if you lean
more heavily on the facilities dep already provides.

I do want to rely on go dep as much as possible :)

However, I need to integrate go dep in the rest of the system, because go dep will only ever manage the go part of a software project, and I'm already seeing (and needing a way to deploy) software projects that cross the go language barrier.

If go dep was better than the existing box, and able to manage everything in the existing box (ie not only the go part), I may mourn a little old tools but that would not stop me from switching.

@sdboyer
Copy link
Member

sdboyer commented Oct 1, 2017

Even if it's not intentional on go dep's part software engineering constrains force most languages to adopt similar choices as soon as they acknowledge those constrains (Java, for example, was awfully long to admit some of those).

i'm a strong believer that each language's package management problems are all generally quite similar, and are best thought of that way; i wrote a lengthy article that rests on that premise.

at the same time, researching and writing that article, and the subsequent time spent working on this project, have convinced me that there are some system properties that do actually matter quite a bit when it comes to design choices in this space - and the ones that end up mattering can be counterintuitive. so i'm generally wary of dismissing language-specific concerns as snowflaking until i'm sure i understand the concern.

some form of standardised go install command would be nice

like...go install/go build?

it would be trivial to change is to /usr/share/gocode/pk/dep/sources/project_tree, as long as pk is a generic root not the project name

while we could potentially design some hacks around this, it's not possible for dep to operate from an arbitrary disk location right now 😢. dep requires placement within a GOPATH/src in order to infer the root import path of the project that it's operating on, so that it can differentiate, at a purely symbolic level, between project-internal imports and ones that point to other projects.

one of the gists i linked earlier describes a possible path towards dep's operation at arbitrary locations.

rpm understands all of "a dependency must be present", "a dependency must be present in set X", and "conflicts with this dependency in this set of Y" (usually sets are either a specific version, more than version X, less than version Y), and a lot more

yes, these are basic examples of what rpm can express. but they don't appear to cover the case i'm concerned about: "if a dependency is required, then it must be in set X". also note that this is only really a problem when combined with the fact that the unit of dependency, the repo/project, has only one Gopkg.toml, but is actually composed of discrete packages, each of which has its own imports and can be discretely imported. again, please see this gist for details.

i realize this may sound similar to optional dependencies, but it really isn't, in large part because of the real difference that we need to talk about:

It is not possible to have a perfectly coordinated software universe where only a single version of a given component exists at one time (except for very restricted and shallow universes). Therefore, single-version systems are utopia. However, while utopia is not realistic in our imperfect universe, it can be a very useful target to pursue.

"installed" - or more precisely, "code is present on local disk" - means something very different in an rpm world than it does for dep, or for most LPMs. for SPMs, there is generally one global "installed" space - as you described it, the subset of the virtual vendor tree that happens to be present on a system - and a large part of the job of rolling a release of a distro is checking that the combinatorial sets of versions are largely mutually sane, such that any individual subset of that universe placed on any particular machine will also be sane. the final expressive form this takes is the code that is actually, literally, on disk.

dep/LPMs, on the other hand, generally avoid global, unversioned pools for their packages, as the scope of their operation is restricted only to a single project at a time. we don't care terribly much about what code happens to be on disk, as code happening to be on disk/installed is not an endorsement of it. that's up to the compiler or interpreter, which the LPM somehow instructs to pick a few particular versions out of the multitudes that may be "installed."

i tend to view this as a reflection of Conway's Law: for LPMs, responsibility for sanity is scoped to a project, and sits in the hands of the developer(s) responsible for that project. it's generally not possible to take on wider responsibility than that by definition, as the set of packages (and people producing them) in LPMs is collectively uncurated (in Go, it's not even knowable). but for SPMs, where the goal is system-level sanity combinatorial, the set of packages & people responsible for composing the known-good combinatorial set of them is generally knowable, as that combinatorial set is itself an artifact to be agreed upon and distributed (as the rpm tree).

so, if you're trying to create a world where all versions of Go software (installed via rpm) on a given machine have to agree on all the versions of their dependencies to use, then - reiterating my last response - you're going to be swimming upstream. it might, might, be acceptable for a world where the user isn't a Go developer, and/or where they aren't installing that much Go software. but this requirement is what GOPATH's structure originally imposed on the Go community, causing great pain, and was the original impetus for the creation of tools like dep.

assurance that in the go dep universe, building a go project will never involve more than one version of the same component

this is a foundational assumption of dep. in fact, barring some weird exceptions that are now possible (but dep disallows), it's a foundational requirement of the go compiler. though, the gist i linked describes one possible form of a planned future where that may no longer be the case.

to be clear, if you're trying to exert control at this level, you could find yourself feeling the need to perform performing absurdly low-level invocations of go's compilation toolchain, rather than relying on the higher-level commands. again, there be dragons.

The alternative is to have a form of component registry, that go dep consults to learn where artifact foo version bar is on the filesystem.

both of my comments so far suggested that a registry would be an easier path.

also, to be clear: you basically can't control dep's behavior by messing with things on the local filesystem. it does its work almost exclusively based on what it finds from
the network (which you would be able to control via a registry). we are looking to add more local control, but it still wouldn't make the local filesystem a good way to achieve your goals.

I need the dep solver to output the result of its computation somewhere (command output or result file) so I can scrap it and translate to rpm speak

the result of dep's solver's computation is the Gopkg.lock file. the more i read, the more i think you may be better served just looking at that.

Where I suspect rpm and go dep design differ is that rpm is fully recursive, while go dep wants to perform a holistic analysis of the root project and all its deps in one phase

i do not have a precise understanding of what you're trying to express by "fully recursive," "holistic analysis," or even "one phase" here - only the general senses of the words. to the user's eyes, dep has one phase. internally, the algorithm is similar to any other constraint solving algorithm - it has many phases and many moving parts, and exits only when it either finds a solution, or determines that there is no solution.

Would it be so difficult for go dep to only compute first-level deps

you may be able to extract what you're looking for from what we hope to make of dep status, or possibly from dep hash-inputs today. but...

, and rely onthe computation it did before for next-level dependencies?

no. the entire reason the underlying problem here is difficult is because graphs do not have a structure that's predictable a priori - you can't effectively partition it into "levels." say we do a full pass through a project's direct dependencies, selecting versions for each. it's entirely possible that, three 'levels' deep, something could point back to one of the original direct dependencies from the "first level" with constraints that forces us to select a different version. some of the "first level" work now must be revisited, which effectively nullifies the meaningfulness of the levels.


while i appreciate the considerable effort you're putting in with these questions and descriptions, let's please leave out the history lessons, assertions about certain tradeoffs being "bad software engineering," and opining about dep's place in your evolutionary model, etc. it smacks of the same sort of condescension with which SPM folks have sometimes treated LPM folks (i.e., like we're irresponsible amateurs) in the past. those conversations have generally not had productive results, and i'd like this to remain productive 😄

@nim-nim
Copy link
Author

nim-nim commented Oct 3, 2017

at the same time, researching and writing that article, and the subsequent time spent working on this project, have convinced me that there are some system properties that do actually matter quite a bit when it comes to design choices in this space - and the ones that end up mattering can be counterintuitive.

I wouldn't put it quite this way, the designs are usually quite similar, what is awfully tricky is the grammar. Everyone reuses the same concepts but the terms often map to slightly different boundaries. That's why I opened the issue, to avoid making bad guesses about the canonical source of Go dependencies and the meaning of those dependencies.

some form of standardised go install command would be nice

like...go install/go build?

Yes I need to investigate more why my predecessors choose to reimplement the install command and not use the go one (for build that's clearly because of GOPATH/src). We use native ecosystem commands in other languages just fine. Distro workarounds while handy are a source of long-term problems and conflicts. Probably a combination of:

  • need to filter out unit tests (We do want to run them at rpm creation time to make sure the package is sane, just not ship them for other packages. It's a waste of space and they drag an awful lot of dependencies). BTW: Go is an awful mess WRT test data and examples, I stopped counting the variations on directory structure and naming.
  • complex install structure (rpm checks the files and directories created in the isolated root map to the package manifest to catch inadvertent changes between versions or upstream mistakes when creating release archives; writing a manifest is trivial when the packaging result is a bunch of binary so files but not when you have a multilevel tree with .go files sprinkled everywhere. So one thing our install macro does is to register created files and directories in a file list that is then fed to the check.) If the Go ecosystem does not adopt shared libraries it would be nice if it were possible to just install an archive of the source files in a standard filesystem place instead of shipping expanded source trees with lots of small files to track.
  • lack of support for install location switches (typically: install for use in somedir, but prepend the buildroot that rpm will isolate in next stage)

Nothing earth shattering, just more work to do our side and Go side so things work well together. And possibly some of this already exists Go-side, I'm just not aware of it. Needs investigating, so many things to do, so little time…

while we could potentially design some hacks around this, it's not possible for dep to operate from an arbitrary disk location right now 😢. dep requires placement within a GOPATH/src in order to infer the root import path of the project that it's operating on, so that it can differentiate, at a purely symbolic level, between project-internal imports and ones that point to other projects.

For what it's worth, it's a major PITA, I had to write a macro that creates a fake _build/src/import_path, with the last leaf symlinked to the current directory, adds the result to GOPATH, and cds to the leaf (the last one for an evil unit test that was not content with the files being in GOPATH with the correct directory structure, but wanted $PWD to match). It works but it's very ugly and it needs to be invoked pretty much before every go command.

Would it be possible to just add a switch to Go commands to explicitly set the import path instead of inferring it from directory structure? The first thing our automation does for go projects is to define an import path variable, for use by macros and other packaging helpers,
it's quite annoying to be unable to pass it to native Go commands.

Alternatively, as I wrote before, it would be nice to have the project id (import_path) part of the project metadata, in addition to project dependencies, so everyone can source it in the same canonical place.

one of the gists i linked earlier describes a possible path towards dep's operation at arbitrary locations.

I mostly like its contents with the following caveats:

  1. IMHO you over-emphasize the need for a registry. You will need to define where to materialize files in the filesystem anyway (at least defaults, because at some points the files will be materialized locally to be useful), and once you've defined "by default, import path foo version bar is installed in someroot/foo/bar or similar it's much easier to just pass someroot to Go commands rather than force them to read/write a registry that will just say things are in the usual place. You don't need the registry indirection and the associated management costs (please tell me if I'm missing something). All the language registries I remember out of hand end up registering that stuff is in it's usual place, because once the novelty factor dies down people have other things to do than change the defaults, and automation will just apply the same rules all the time. Being able to define a custom installation root (or hierarchy of roots: system + project) is sufficient flexibility, one does not really need custom installation structures if the default one is well though-of. If you fear the initial structure won't stand the test of time because your understanding of the problem-space is limited at this stage just make a provision for future structure revisions by versionning or id-ing them (and store the version/id in a file in the structure root).

  2. it is usually sane to define very clear fallback rules at the time you define locations for explicit versions (for example if a project does not request an explicit dependency version or commit, just use whatever version is locally available in foo place or bar registry entry; or, even, allow a mode where constrains are relaxed and the developer can test his stuff with whatever version is available even if it does not match perfectly his definitions). It is very easy to lock oneself in a maze of explicit version requirements, that are inconvenient to refresh over time or when you start sharing code with another project that uses the same dependencies (but not the exact same version). A lot of projects that start with carefully thought-of and tested dependencies just switch them to whatever's available over time, because it usually works, any version will need testing, and they don't have the time to check the history of all the stuff they depend on. So they really need a mode like "use whatever is available here, I already tested in other_project this stuff works well together, I don't want to refresh the exact versions to use every time other_project QA forces a change"

rpm understands all of "a dependency must be present", "a dependency must be present in set X", and "conflicts with this dependency in this set of Y" (usually sets are either a specific version, more than version X, less than version Y), and a lot more
yes, these are basic examples of what rpm can express. but they don't appear to cover the case i'm concerned about: "if a dependency is required, then it must be in set X".

Actually "conflicts with dependency X in set of Y" has the same rôle in rpm land. It translates to "if the solver needs to make available X (and the only reason it may want to make available X is because it is required by something, it can choose any version but the set in Y". Basically rpm wants projects to be optimistic about what they accept, forbid what they actually know won't work, instead of whitelisting a small set of tested versions. The reason is that in the rpm distributions experience whitelisting results in overly strict and rigid requirements, that make it awfully hard to share code or update components, resulting in deadlocks or even crisis when a security alert forces quick updating of a project (our solver people hate explicit version deps). I agree the rpm variant is unintuitive, that left alone projects will ask for the one you chose (and not realise they are deadlocking themselves before it is too late).

So there is no problem translating to rpm, as long as there is a clear way to invert the set you want in a forbidden set (I, obviously, would prefer to have go dep perform the inversion, since it understands Go dependencies way better than I will ever do. I can code an inversion macro otherwise rpm-side if you point me to the kinds os set that would need inverting).

(The alternative is to force install of the dep in the defined version set which is ugly and wasteful but trivial to do).

also note that this is only really a problem when combined with the fact that the unit of dependency, the repo/project, has only one Gopkg.toml, but is actually composed of discrete packages, each of which has its own imports and can be discretely imported.

That is not a problem, there is no identity between source project and unit of dependency in rpm land. I can easily tell rpm either to split the source project in separate rpm files (one unit of dependency per file), or have a single rpm file that provides all the units of dependency, or a mix of both¹. My limitations are that:

  1. all the units of dependency that end up in the same rpm file have to share the same constrains.
  2. every split is one more rpm file to track and manage, so I'm not interested in one rpm package-per-go-file scenarii (to caricature).
  3. I really want to avoid cases where different rpm files provide the same unit of dependency. It is sometimes unavoidable (projects which are drop-in replacements for other projects…) but gets quickly too tricky to manage sanely.

And I'd rather use the same splitting strategy all the time than revisit the rules for every go project.

"installed" - or more precisely, "code is present on local disk" - means something very different in an rpm world than it does for dep, or for most LPMs…
so, if you're trying to create a world where all versions of Go software (installed via rpm) on a given machine have to agree on all the versions of their dependencies to use, then - reiterating my last response - you're going to be swimming upstream. it might, might, be acceptable for a world where the user isn't a Go developer, and/or where they aren't installing that much Go software.

Well that's precisely the point. An rpm system is first and foremost user-oriented not developer-oriented. It is optimized to consume software not develop it. It is designed so users do not have to worry about all the special rules that matter when developing each of the tens of thousands of components that a user may consume daily as part of its IT usage. And that's perfectly fine like this. You need systems that optimize working on a specific component in a specific language (the various language-specific and developer-oriented systems like go dep) and you need systems (like dnf/rpm) to distribute the result as part of a larger whole as soon as your target audience becomes larger than the original developer set. To maximize success the handover from developer-oriented to user-oriented system must be smooth. Otherwise at the first problem each stakeholder project will insist on users downloading its own binaries made with its own rules on his own site which basically does not integrate nor scale even though it is the most convenient for the original developers (as long as they are not cast in the user role).

That being said you've missed a very important nuance of modern Linux systems. Except for specific distributions such as Gentoo, nowadays Linux packages are not built in the target run environment. They are built in isolated secure minimal environments, to limit unwanted side-effects, make sure the result can be easily reproduced, and for security reasons. That's where the "no network calls at project build stage" is coming from. That also means it is not too difficult to have one component version to exist when building A, and another when building B. Technically, rpm distinguishes between build dependencies (stuff that needs to exist when creating a rpm file, typically for a go project that does not use shared libraries the source files of all the projects it depends on), and runtime dependencies (stuff that needs to exist when using the software, which version needs to be agreed on by all the things running at the same time, or bad things happen).

So an rpm environment can accommodate different build-time dependencies for different projects (actually, runtime too as long as the file paths are different and software is able to distinguish between the runtime versions to choose the one it needs). It just needs unambiguous versionning of the build constrains to know what version to make available at build time. Therefore it is very possible to translate a dependency system where versions vary from project to project to the rpm solver.

However, and practically, there will be a demand from distributors to projects to converge on a limited set of versions of the same component. Every new version you inject in the system in another version to check in case of problems (security or user-reported). So while in theory infinite variability is possible in practice no one has the means to handle it. The post mortem of any serious security incident nowadays is almost always 'the fix existed, but it was not identified or applied everywhere', usually because vendoring variability made identifying and propagating the fix overly difficult.

I know developers hate this kind of demand. It's more work their side. It's insurance for problems that may never materialize. I also know of no magic solution. Pretending no problem will ever be identified in a bit of code in the future is not realistic. Pretending that projects that could not agree on common dependency versions will suddenly switch to the same ones on a short notice in crisis mode is not realistic. It's a lot easier for everyone involved if your starting point is a limited version set (limited, not necessarily single-version). That's where the "everyone uses his own git commit" model falls over and you start needing releases and versionning work, with a limited set of major versions and enough backwards compatibility inside a major version every user can be easily upgraded to the major tip with the security fixes. That's basically what Google is attempting for Android with Trebble.

assurance that in the go dep universe, building a go project will never involve more than one version of the same component

this is a foundational assumption of dep. in fact, barring some weird exceptions that are now possible (but dep disallows), it's a foundational requirement of the go compiler. though, the gist i linked describes one possible form of a planned future where that may no longer be the case.

I hope this future does not materialise, Java permits such a thing through classpath hierarchies, and it's an endless source of subtle and difficult to diagnose problems. The convenience of not having to synchronise on a common version while building a single binary is not worth it. As long as most Go projects don't use this, we'll ignore the others.

also, to be clear: you basically can't control dep's behavior by messing with things on the local filesystem. it does its work almost exclusively based on what it finds from
the network (which you would be able to control via a registry). we are looking to add more local control, but it still wouldn't make the local filesystem a good way to achieve your goals.

Good way or not I don't have the choice, no network at build time is a hard rpm requirement, I can't change it. I can teach rpm to read a Go project build dependencies ("when something tries to build against this project, it will also need X Y and Z with those constrains") but I can't let anything inside the project reach anywhere outside the restricted build environment that will only contain the files corresponding to the project, X Y and Z, and the go compiler. That's how rpm still manages to build software originally released on long dead usenet forums or mailing lists (some of the parts of a Linux system are that old). Github can burn tomorrow all the projects it hosts will still build fine when packaged in an rpm distribution (same thing for the Red Hat parts BTW).

I need the dep solver to output the result of its computation somewhere (command output or result file) so I can scrap it and translate to rpm speak

the result of dep's solver's computation is the Gopkg.lock file. the more i read, the more i think you may be better served just looking at that.

Ok I'll focus on this then. Thank you for the fine advice ! Is this file syntax described somewhere?

Where I suspect rpm and go dep design differ is that rpm is fully recursive, while go dep wants to perform a holistic analysis of the root project and all its deps in one phase

i do not have a precise understanding of what you're trying to express by "fully recursive," "holistic analysis," or even "one phase" here - only the general senses of the words.

Sorry if I wasn't clear. What it means basically is that rpm only computes the requirements of any given unit of dependency once. You can not reinterpret them in the light of the declarations of different root projects.

So a unit of dependency can state:
"If X exists it must not be in set Y"
(Or with go dep grammar "If X exists it must be in set notY)

And X may be pulled in or not by the root project or another unit, the constrain will still work.

But a project that needs the unit of dependency can not state
"ignore constrain X of unit of dependency, I want something else"

The final constrains are the intersection of the constrains of all the units involved in building a project, a unit controls the constrains it injects in the rpm solver but not what the other units injected.

I hope that's more clear and compatible with go dep, if not I can try to explain it some other way. I did not intend to mean that once something deep in the dependency tree put a constrain on a unit of dependency, no other unit nor the root project could put another constrain on the same unit. That's perfectly fine as long as there is a way for the solver to satisfy both things at once².

I'm sorry if all this is very long and verbose, want to avoid misunderstandings and create something that works well for Go projects and Go project users. Do tell me when I'm not on the right level of abstraction.

it smacks of the same sort of condescension with which SPM folks have sometimes treated LPM folks (i.e., like we're irresponsible amateurs) in the past. those conversations have generally not had productive results, and i'd like this to remain productive 😄

Hey I didn't mean to disparage your knowledge in any way, you're way more competent than me in some domains (starting with Go). I'm just sick to death of being treated like an irresponsible amateur by project people, that can't seem to imagine Linux distros may have ever learnt something about dependency management despite shipping successfully tens of thousands of software projects for several decades³. I guess I was being defensive, sorry 😄. I'd rather discuss the how of go dep and rpm solver choices, and why we feel they are a good engineering choice, in order to find a technical common ground. Thank you for your patience!

¹ In fact projects, once packaged by rpm distributions, are often more granular than at the original project level, because the original project often feels all his files are useful all the time, while project users may disagree.

² For example if a unit of dependency specifies X in set Y, and another X in set Z, and Y and Z are disjoint, rpm will consider the constrains solvable if it finds something that provides X in set Y, and another thing that provides X in set Z with no file conflict, and unsolvable if they both exist, but they have at least one file in common, with different hashes. That's quite subtle and unintuitive and one reason I worry do much about filesystem use conventions. That's also why rpm uses conflicts to limit the version set : X > Y + X < Z is not equivalent to Y < X < Z in some cases.

̧³ Usually because their measure of technical success if how much you charge for the result, and any death march software project charges more than your average distribution.

@sdboyer
Copy link
Member

sdboyer commented Oct 13, 2017

sorry, i will get back to this soon - just stupid-busy at the moment 😢

@nim-nim
Copy link
Author

nim-nim commented Oct 19, 2017

NP, I perfectly understand, go-dep is not the only thing needing work to make distributing Go software less painful

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants