x86: Shadow stacks compatibility for interceptor #791

yjugl · 2024-04-03T16:31:49Z

This patch makes interceptor code compatible with Intel CET shadow stacks.

Make the on_leave_trampoline a legit address to return to from a CET-enabled intercepted function's point of view.
Ensure a proper call/ret discipline to not interfere with shadow stacks.

On Windows, this means that it will now be possible to intercept functions that live in CETCOMPAT modules when the user shadow stack mitigation is active (though not in strict mode).

yjugl · 2024-04-04T12:41:39Z

I have tests failing locally and I'm taking a look. It seems that INCSSPQ yields an illegal instruction if we're on a CPU that supports shadow stacks but we are not currently using shadow stacks, so I would have to check IA32_U_CET.SH_STK_EN before using that instruction. Edit: Oh, this code path is broken anyway. I'm not popping on_leave_trampoline but the current return address. This needs a rewrite.

yjugl · 2024-04-05T11:36:56Z

The current patch works on my machine (which has CET shadow stacks support) and still lets me intercept functions from CETCOMPAT modules in processes that use the user shadow stack mitigation on Windows. It is now ready for review.

oleavr

Yay, this is awesome! 🤩 Thanks a lot for doing this.

Since it took me so long to review due to a rabbit-hole that kept me busy for way too long, I went ahead and made the final tweaks needed to merge this (only a few minor tweaks). While doing that I noticed the CI acting up, so I'm going to get to the bottom of that and remove those debug commits before merging.

This makes Interceptor code compatible with Intel CET shadow stacks. - Detect CPUs that have support for CET shadow stacks. - Make the on_leave_trampoline a legit address to return to from a CET-enabled intercepted function's point of view. - Ensure a proper CALL/RET discipline. On Windows, this means that it will now be possible to intercept functions that live in CETCOMPAT modules when the user shadow stack mitigation is active (though not in strict mode).

yjugl · 2024-05-13T09:26:14Z

Yay, this is awesome! 🤩 Thanks a lot for doing this.

Since it took me so long to review due to a rabbit-hole that kept me busy for way too long, I went ahead and made the final tweaks needed to merge this (only a few minor tweaks). While doing that I noticed the CI acting up, so I'm going to get to the bottom of that and remove those debug commits before merging.

Hey, you're welcome! Thanks for reviewing and integrating the patch. Let me know if some points are still unclear. Actually there are two points I would like to emphasize here.

First, I took a somewhat arbitrary design decision. The question is essentially, do we want all users to run the same code always, although that includes instructions that are useless if the mitigation is not active? There are essentially three possibilities here:

(a) ensure proper call/ret discipline for all x86 users always ;
(b) ensure proper call/ret discipline for all x86 users for which we detect that the CPU supports shadow stacks, even if the current process is not making active use of the mitigation ;
(c) ensure proper call/ret discipline only if the current process makes active use of the mitigation.

I would say (a) is the best long-term solution from a maintenance and stability point of view because it guarantees that the code that users run takes exactly the same code paths that were run during tests no matter what. (c) would be the exact opposite, only running the extra instructions when necessary, at the cost that you may discover subtle bugs at some point because the tests ran without shadow stack support and the user runs Frida in a process that has them, or vice versa. (c) also requires support from the operating system to know if the mitigation is currently active, so it is a bit more work on the portability and maintenance side.

The current patch implements (b) which is some kind of a compromise based on the idea that if your CPU is recent enough to support shadow stacks, it's probably also fast enough to run the extra instructions without any significant performance cost and as time passes and people change their CPUs, (b) gets closer to (a) anyway. I would probably have gone for (a) if I owned the library, because I don't think these extra instructions will actually lead to a significant performance regression for anyone, but I went for (b) because I can't guarantee that. If you think (a) or (c) matches better with the goals for Frida, we could adapt the patch.

Second, we're replacing a ret with a lea rsp, [rsp+8]; jmp [rsp-8] in gum_emit_epilog under GUM_POINT_ENTER. I think that should be fine, but if you encounter subtle issues with some tools or processes, that would be something to double check. For example it would not be fine to do that in kernel mode, where an interrupt between the two instructions could end up replacing the value at rsp-8 without restoring it. It may be the case that some userland tools like profilers would also assume that the stack in not used at negative offsets from rsp; or that some programs would rely on that assumption for themselves when they suspend their own threads; I'm not 100% sure. We may want to use a pop rax; jmp rax instead but I wasn't sure if any register could be used 100% reliably here either, as we could be jumping to a function that has a weird calling convention. So I didn't really find a more satisfying solution.

oleavr · 2024-05-13T12:59:18Z

Hey, you're welcome! Thanks for reviewing and integrating the patch.

🙌

Let me know if some points are still unclear. Actually there are two points I would like to emphasize here.

First, I took a somewhat arbitrary design decision. The question is essentially, do we want all users to run the same code always, although that includes instructions that are useless if the mitigation is not active? There are essentially three possibilities here:
(...)

I totally agree that a) is the ideal from a maintenance and stability point of view. But I think we should strive to reduce our overhead over time, so we can minimize our impact on timings, to be able to observe a system without changing its behavior more than necessary. So I think I would lean towards c). Given that Interceptor is used for a lot of things and these trampolines are common to all use-cases, it seems worth it to trade some simplicity for speed.

Second, we're replacing a ret with a lea rsp, [rsp+8]; jmp [rsp-8] in gum_emit_epilog under GUM_POINT_ENTER. I think that should be fine, but if you encounter subtle issues with some tools or processes, that would be something to double check. For example it would not be fine to do that in kernel mode, where an interrupt between the two instructions could end up replacing the value at rsp-8 without restoring it. It may be the case that some userland tools like profilers would also assume that the stack in not used at negative offsets from rsp; or that some programs would rely on that assumption for themselves when they suspend their own threads; I'm not 100% sure. We may want to use a pop rax; jmp rax instead but I wasn't sure if any register could be used 100% reliably here either, as we could be jumping to a function that has a weird calling convention. So I didn't really find a more satisfying solution.

Good point! I think this is fine for now. We currently ensure that we don't clobber any registers or CPU flags on x86, and I think it is important to retain this guarantee as Interceptor does indeed support instruction-level probes as well as targets with custom calling conventions.

yjugl · 2024-05-13T13:54:37Z

I totally agree that a) is the ideal from a maintenance and stability point of view. But I think we should strive to reduce our overhead over time, so we can minimize our impact on timings, to be able to observe a system without changing its behavior more than necessary. So I think I would lean towards c). Given that Interceptor is used for a lot of things and these trampolines are common to all use-cases, it seems worth it to trade some simplicity for speed.

In that case, let me expand a bit on (c). Another issue with (c) is the following: if a process is currently not using the mitigation, could the mitigation become active later in that same process? Because if that is possible, then (c) is dangerous: we could be generating code now that would not be correct anymore at a later point in the process.

My testing so far suggests that on Windows at least, either a process has the mitigation from the start, or it will never have it. That can be specified at process creation using UpdateProcThreadAttribute with PROCESS_CREATION_MITIGATION_POLICY2_CET_USER_SHADOW_STACKS_*. It does not seem possible to opt-in for the mitigation during the life of the process, which could use SetProcessMitigationPolicy with ProcessUserShadowStackPolicy but that gets rejected. It seems only GetProcessMitigationPolicy would be valid for that policy. If that is correct, then (c) could make sense on Windows. I don't know if that generalizes to other platforms however.

oleavr · 2024-05-13T21:15:44Z

In that case, let me expand a bit on (c). Another issue with (c) is the following: if a process is currently not using the mitigation, could the mitigation become active later in that same process?

Good point! It seems unlikely, but I don't know either.

If that is correct, then (c) could make sense on Windows. I don't know if that generalizes to other platforms however.

Cool! I suppose we could do that for Windows only initially, and expand on it later as we learn more about the other platforms' internals.

yjugl force-pushed the x86-interceptor-shadow-stacks branch from 666e10e to 8147f85 Compare April 5, 2024 11:30

yjugl added 2 commits May 12, 2024 14:09

cpu-features: Add CET_SS flag and detection logic

a1bd597

x86-writer: Add cpu_features field

65aa675

oleavr force-pushed the x86-interceptor-shadow-stacks branch 2 times, most recently from 31b7474 to ba13090 Compare May 12, 2024 12:18

yjugl requested a review from oleavr as a code owner May 12, 2024 17:01

oleavr approved these changes May 12, 2024

View reviewed changes

oleavr force-pushed the x86-interceptor-shadow-stacks branch from fe86f1f to aceb3ae Compare May 13, 2024 09:15

oleavr merged commit a28979e into frida:main May 13, 2024
31 of 32 checks passed

yjugl mentioned this pull request Jun 9, 2024

cpu-features: Ignore CET SS unless actively used #804

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x86: Shadow stacks compatibility for interceptor #791

x86: Shadow stacks compatibility for interceptor #791

yjugl commented Apr 3, 2024

yjugl commented Apr 4, 2024 •

edited

Loading

yjugl commented Apr 5, 2024

oleavr left a comment •

edited

Loading

yjugl commented May 13, 2024 •

edited

Loading

oleavr commented May 13, 2024

yjugl commented May 13, 2024 •

edited

Loading

oleavr commented May 13, 2024

x86: Shadow stacks compatibility for interceptor #791

x86: Shadow stacks compatibility for interceptor #791

Conversation

yjugl commented Apr 3, 2024

yjugl commented Apr 4, 2024 • edited Loading

yjugl commented Apr 5, 2024

oleavr left a comment • edited Loading

Choose a reason for hiding this comment

yjugl commented May 13, 2024 • edited Loading

oleavr commented May 13, 2024

yjugl commented May 13, 2024 • edited Loading

oleavr commented May 13, 2024

yjugl commented Apr 4, 2024 •

edited

Loading

oleavr left a comment •

edited

Loading

yjugl commented May 13, 2024 •

edited

Loading

yjugl commented May 13, 2024 •

edited

Loading