Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should SPIRE support non-unique node attestation? #558

Closed
ajessup opened this issue Aug 7, 2018 · 12 comments
Closed

How should SPIRE support non-unique node attestation? #558

ajessup opened this issue Aug 7, 2018 · 12 comments

Comments

@ajessup
Copy link
Member

ajessup commented Aug 7, 2018

The current design of SPIRE assumes that each Agent is running on a node that can be identified uniquely (eg. by a join token or an EC2 Instance ID), even if the workload identified by an Agent may span multiple nodes.

In some cases however - while properties of the infrastructure can be attested, there may be no (verifiable) way to identify a unique node an Agent is running on.

As an example, a SPIRE Agent deployed to ECS running AWS Fargate will not be able to retrieve an instance ID (though it can retrieve an STS token that will verify the IAM roles associated with the service).

In theory, it should be possible to attest AWS workloads based on IAM roles encoded in STS tokens in the same way the aws-iid module attests workloads based on Instance Identity Documents. But the requirement that each node be uniquely identified prevents this.

The scope of this issue to discuss motivating use cases and possible solutions to this problem in SPIRE.

cc // @mlakewood @evan2645 @grittershub

@grittershub
Copy link

grittershub commented Aug 8, 2018

Use cases could be:

Containerised microservice, with an IAM Task Role allocated (https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-iam-roles.html), running within a AWS ECS Cluster wishes to receive a SPIFFE id to communicate with other microservices. Using the recommended ECS agent deployment model containers do not have direct access to the underlying host instance metadata service.

Containerised microservice, with an IAM Task Role allocated (https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-iam-roles.html), running within a AWS ECS Cluster in Fargate launch mode wishes to receive a SPIFFE id to communicate with other microservices. By design Fargate launched containers do not have access to the underlying host instance metadata service, nor can any agent be installed on the underlying host.

Lambda microservice, with an IAM execution role allocated (https://docs.aws.amazon.com/lambda/latest/dg/intro-permission-model.html#lambda-intro-execution-role) wishes to receive a SPIFFE id to communicate with other microservices. Lambda functions do not have access to any underlying host instance metadata service (by design).

@ajessup
Copy link
Member Author

ajessup commented Oct 22, 2018

Some more detail on why this exists in SPIRE today, and possible paths for solving for it.

In the current design of SPIRE, when an agent performs node attestation it generates a CSR (which includes the name of SPIFFE ID that should identify the agent) and passes it to the server, along with the data used for attestation (such as the AWS IID). The SPIFFE ID is specified by the agent during the attestation process and needs to have the following properties:

  • Is unique to the instantiation of the agent (more on this below)
  • Can be proven with the data used for attestation. For example, if an AWS IID is used, the value must be verifiable in the IID.

An agent SPIFFE ID might be something like spiffe://prod.acme.com/agent/aws-iid/i-1db1512c8.

The reason that SPIRE currently requires that each instantiation of an agent is uniquely identifiable is to ensure that when SPIFFE certificates are renewed, only the agent that was issued a given certificate is able to exchange it for a new one.

There is a possible short term and a possible long term solution to this problem.

The possible short term solution is to design a new node attestor, which is able to pass some unique per instantiation value from the it’s agent plugin to the server plugin during the attestation process. This could just be a random number generated (and persisted for that instantiation). The SPIFFE ID could be reconstructed from that number. For example, spiffe://prod.acme.com/agent/aws-sts/123456 where 123456 is a randomly generated number or UUID.

The possible longer term solution is to redesign the node attestation process entirely to avoid the need for SPIFFE IDs to be passed by the agent to the server in a CSR during attestation. This would then allow for nodes to be assigned “random” SPIFFE IDs by the server if an instance can’t determine SPIFFE ID a priori.

In both solutions the server side attestor can emit the verified selectors to be used in an attestation policy, and this can be anything asserted by the identity document (such as an AWS IAM role from an STS token). These will then be used by SPIRE to determine which other identities the agent is entitled to issue - which could include the original workload ID (eg. spiffe://prod.acme.com/billing/invoicer).

@adon-at-work
Copy link

+1. would love to hear updates for this development, if any

@brianwolfe
Copy link

This may be stale, but attestation based on the aws iid has some limitations. Beyond the specific cases described here where the iid & derivatives may not be definitive for attesting a workload identity, the iid is not rotated as frequently as the IAM credential, increasing the potential of replay attacks or similar.

Has anyone considered using the Vault/aws-iam-authenticator approach for workload attestation instead of iid now that sts:GetCallerIdentity works? https://github.com/kubernetes-sigs/aws-iam-authenticator#how-does-it-work

@bigdefect
Copy link

bigdefect commented Aug 27, 2020

@ajessup To confirm my understanding, if two spire-agents were to attempt to connect to a spire-server identifying themselves as spiffe://prod.acme.com/agent/aws-iid/i-1db1512c8 as in your example, spire-server would reject one of them?

Adding onto the earlier mentions of ECS, even if you get past the expectation of requiring the containers to reach out to the EC2 IMDS, if multiple ECS tasks (containing a spire-agent) are co-located on the same instance, those agents will then share the same IID, thus rejecting one of them.

Beyond that, multiple disparate workloads could end up on the same instance (ignoring complex placement strategies) so I think that means spire-server would need registrations of every combination of workload and instance. I'm new to SPIRE so tell me if I've made an error in either of the last two paragraphs.

I've been thinking about this for aws/aws-app-mesh-roadmap#68 where @evan2645 had chimed in months ago. I wondered if there could be some kind of composite attestation, where the EC2 IID is verifiable, and the task metadata is used for identification of a task; though the latter can't be verified by spire-server without a signature.

That aside, I like the suggestion by @brianwolfe to use a strategy similar to the IAM authenticator, that'll work on fargate too. Since a role is insufficient to identify a task, It would have to be used alongside either some other unique metadata per ECS task or the "short term solution" discussed above.

@ajessup
Copy link
Member Author

ajessup commented Aug 28, 2020

I think there's some confusion here that my original wording of this issue likely contributed to. But just to clear things up, in the current design of spire, agents must be assigned to a single uniquely identified node, but workloads do not. It is quite possible to register a workload in SPIRE with a policy that it may be assigned to a workload on any node that matches a certain set of conditions (such as a particular AWS tag). This is, as you point out, necessary when workloads are running on a container scheduler like ECS or Kubernetes, since a workload may well be scheduled on any or several of a large number of nodes at any given time. More on this below.

Generally this isn't a problem if you're operating in an environment where there is an agent that can run on each node, and can have a way to prove it's unique identity the SPIRE server. Where this becomes problematic is when there is no "node" as such (eg. Fargate, Lambda) and the entity that is acting as the agent can't uniquely identify a single instantiation of itself. This is due to a set of assumptions in the API that SPIRE Agents use to authenticate to SPIRE Servers (the Node API).

The SPIRE project has been busy refactoring the SPIRE APIs to address this, as well as several other limitations (see #1057 for the latest). These APIs should significantly simplify "agentless" SPIRE deployments, but I'll let @azdagron keep me honest there.

@ajessup To confirm my understanding, if two spire-agents were to attempt to connect to a spire-server identifying themselves as spiffe://prod.acme.com/agent/aws-iid/i-1db1512c8 as in your example, spire-server would reject one of them?

Adding onto the earlier mentions of ECS, even if you get past the expectation of requiring the containers to reach out to the EC2 IMDS, if multiple ECS tasks (containing a spire-agent) are co-located on the same instance, those agents will then share the same IID, thus rejecting one of them.

Note that the expectation here is that a SPIRE Agent is per-node (or more formally, per-kernel), and can service multiple workloads running on that node. In linux this is generally accomplished with a system deamon, and in Kubernetes with a Deamonset resource to control placement (I don't know if ECS supports similar "Guarantee N copies per node" placement rules).

Individual workloads then call the Workload API exposed by the Agent over a unix domain socket (in Kubernetes, this is mounted into each pod). The agent then inspects the PID and other metadata of the calling process to determine exactly which container is calling it.

Beyond that, multiple disparate workloads could end up on the same instance (ignoring complex placement strategies) so I think that means spire-server would need registrations of every combination of workload and instance. I'm new to SPIRE so tell me if I've made an error in either of the last two paragraphs.

If you're able to get an agent on each node (if you're running on EC2, it might be feasible to have it start as part of the base VM image, for example) and there's some provable element of each node (see here for the complete list of selectors you can use https://github.com/spiffe/spire/blob/master/doc/plugin_server_nodeattestor_aws_iid.md), then you can define your workload in terms of a node selector (say, the node must have a particular instance tag) as well one or more workload-specific selectors (say a docker image name).

I've been thinking about this for aws/aws-app-mesh-roadmap#68 where @evan2645 had chimed in months ago. I wondered if there could be some kind of composite attestation, where the EC2 IID is verifiable, and the task metadata is used for identification of a task; though the latter can't be verified by spire-server without a signature.

Without knowing the intricacies of ECS, I wonder if it would be possible to write a SPIRE Agent plugin that is an ECS equivalent of https://github.com/spiffe/spire/blob/master/doc/plugin_agent_workloadattestor_k8s.md?

The way that plugin works is to maps the metadata that the SPIRE Agent retrieves from the workload via the unix domain socket (UID, PID, etc.) and Kubernetes metadata (eg. k8s namespace or pod label) from the kublet process running on the same node, and exposes these attributes as selectors in registration policies. I wonder if there's a similar node local component in ECS that understands task metadata and can map that to unix primitives?

That aside, I like the suggestion by @brianwolfe to use a strategy similar to the IAM authenticator, that'll work on fargate too. Since a role is insufficient to identify a task, It would have to be used alongside either some other unique metadata per ECS task or the "short term solution" discussed above.

I suspect once the refactored APIs are in place, an STS-based attestor will become very attractive.

@bigdefect
Copy link

Thanks Andrew. I think that largely aligns with my current understanding. I probably contributed to some confusion, I didn't mean to suggest that workloads are uniquely identifiable, but I was hung up on the idea of registering every combination of unique agents and workloads so any workload task could be placed anywhere.

I don't know how people are actually using SPIRE, so it's hard for me to form an intuition on workload selectors. But based on how the k8s registrar works, it seems like the expectation is that workloads can be registered as they come up, and as long as the node and workload attestations succeed, you're good.

I don't know if ECS supports similar "Guarantee N copies per node" placement rules

ECS does have a daemon mode for ECS services, but they're not intended for runtime dependencies (or however you would categorize it), rather things like log exporters. Customers do control their instances when using ECS, and some do manage their own images and daemons, but it drastically raises the barrier to entry, in addition to introducing relationships that are complicated or impossible to manage with ECS APIs.

My mind had gone straight to treating each task as a node and having a spire-agent container alongside each workload. Then customers can model it like any other container and set up order and such. But that's what brought me to the concerns around attesting even a node in that environment, let alone the workload.

I wonder if there's a similar node local component in ECS that understands task metadata and can map that to unix primitives?

Task metadata does provide a whole bunch of information on both the task and the node. But it doesn't map to unix primitives.

As I mentioned above, spire-server can verify an IID, but there's no equivalent for task metadata. It's certainly possible for:

  1. spire-agent to use its task metadata for node attestation data
  2. spire-server reaches out to the appropriate ECS api to describe the tasks running on an instance, and verify that the attested task is indeed present

But, in order for that to not be spoofed, presumably you'd need some complex logic to sign the task payload and verify it on spire-server. I haven't thought about it much, but I'm happy to be proven wrong.

Of course, the "right" solution is one that works across ECS and Fargate, likely relying on a combination of STS and task metadata. The session name for task roles at least do have the task id.

@ajessup
Copy link
Member Author

ajessup commented Aug 30, 2020 via email

@bigdefect
Copy link

Appreciate the thoughtful response.

Broadly, yes. SPIRE requires each SPIFFE ID be registered before it is
identified. The ID is specific, but the selectors can be generic.

I was going down the thought process of whether the agent could have a mode where the user is guaranteeing that it will be co-located exclusively with one workload (as you would have in an ECS task). If we accept the threat boundary, then it can just vend the SVID to whichever workload asks for it.

^ This is setting aside the benefits of a persistent agent you've mentioned above, and ignoring the overhead of an agent just for SVID retrieval. It leads naturally into the "agentless" discussion.

So if there's an agent and a
node-local API to the local ECS scheduler, and that API allows for mapping
kernel metadata (like PID) to ECS metadata (like Task ID) then this should
be all you need.

The ecs-agent that manages the containers may have that information available (though it wouldn't make much sense to integrate with it). Kernel metadata isn't mapped to task metadata, but docker metadata is. It may be doable to attest the node using EC2 IIDs, and workloads using docker labels (which includes task info).

That being said, an architecture where containers have hard dependencies on software running on the instance isn't something we'd recommend.

Beyond STS, is there any kind of signed identity document issued by ECS to
a workload (an equivalent of Kubernetes' Service Account Tokens, for
example) that could be used to prove properties of a task that a SPIRE
Server could independently verify?

As far as I know, there isn't an equivalent in ECS today. I'm starting to reach out to folks in the know to see how this might be solved, but at least we do have some expertise from solving that similar problem in kubernetes.

The other option I'm thinking about is using something like a private key stored in Secrets Manager, which ECS can provide to a container. That's more overhead to set up, but tractable.

@bigdefect
Copy link

@ajessup A follow-up as I re-read this

The reason that SPIRE currently requires that each instantiation of an agent is uniquely identifiable is to ensure that when SPIFFE certificates are renewed, only the agent that was issued a given certificate is able to exchange it for a new one.

Is this a delivery optimization (so multiple agents aren't caching the same SVIDs)? Or is it a security concern (something like, every instance of a workload (even if those workloads are homogenous, like replicas of a service) should have its own SVID)?

@dfeldman
Copy link
Member

dfeldman commented Sep 3, 2020

Is this a delivery optimization (so multiple agents aren't caching the same SVIDs)? Or is it a security concern (something like, every instance of a workload (even if those workloads are homogenous, like replicas of a service) should have its own SVID)?

It is a security concern, but just to reduce the probability of an attacker stealing and replaying certificates. I believe there is nothing stopping multiple instances of the same workload from getting identical SVIDs if they have identical selectors right now.

The other option I'm thinking about is using something like a private key stored in Secrets Manager, which ECS can provide to a container. That's more overhead to set up, but tractable.

This is the approach used in the Square implementation of SPIFFE for lambda functions. https://developer.squareup.com/blog/providing-mtls-identities-to-lambdas/

@azdagron
Copy link
Member

azdagron commented May 6, 2022

Hi @ajessup! We haven't seen movement on this issue for a while. Is this discussion still relevant given the direction we've headed for serverless architecture support? If there are still specific integrations in mind, I'd suggest opening a new issue to discuss the specific integration that we can scope independently.

I'll go ahead and close this for now, pending any new discussion.

@azdagron azdagron closed this as completed May 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants