-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LibOS,PAL/Linux-SGX] EDMM: Introduce lazy allocation #1099
Comments
@vijaydhanraj You're working on this, right? Let me assign you now, and if this is wrong, please reply and I'll remove the assignment. UPDATE: I forgot that I can't assign non-maintainers. @vijaydhanraj Please write something in this issue, so that GitHub allows me to assign you. |
Hi @dimakuv, please assign this to me. |
I'm not quite familiar w/ the context and was just looking into this requirement. I have the following assumptions and would like to have more discussions/clarifications.
Pls correct me if anything incorrect. Thanks! |
@kailun-qin The proposed changes and flow look correct to me. Below are some notes.
Actually, sorry, the more I think about
I don't like the new manifest option. Way too many options, and the "lazy acception" optimization shouldn't be controversial -- it seems to be always beneficial. I don't think we need to require |
@dimakuv Thanks for the feedbacks! Pls kindly see my comments below.
Yeah, make sense to me.
Yes, I agree.
But LibOS cannot tell whether the zero faulting address is actually due to the app itself or the To complete the proposal, there are two other points that're not covered yet:
Any thoughts? Thanks! |
I don't understand your point. If If our current LibOS #PF handler doesn't check
I am not sure what you mean by this. Do you mean that file-backed mappings can also be with
Yes, here I see the problem. I vote against both suggestions:
Can we use any additional interfaces of the SGX driver API, or can we use some properties of the SGX instructions to "get information" about the state of the to-be-freed enclave pages? E.g., maybe |
Ah OK, I don't think we check it and we should probably check this in gramine/pal/src/host/linux-sgx/pal_exception.c Lines 323 to 325 in 3b5c88b
Yes, this was what I meant.
I agree that rare applications use file-backed mappings with
I don't think
What about we try to pre-fault all to-be-freed pages at the very beginning of
EACCEPT (using initial page permissions), regardless whether they've been accepted or not.
If it's already committed, then we get I think this should introduce similar overhead if there is an SGX driver API / SGX instruction that we can rely on to "get information" about the state and handle accordingly, because this approach basically leverages |
Yep. Will you create such a PR?
Yes, let's not cover this case. Just put a FIXME comment in the code, that we currently ignore this case because we consider it rare and not worth optimizing for performance.
Yes, this approach looks ok-ish. I'm afraid we won't come up with anything better than this, at least not currently.
I think the overhead will be significant, because for such I was hoping for an overhead of Also, maybe this is a good point to ask SGX driver developers whether they can suggest a better solution, or even introduce a new IOCTL or something specially for us? |
I created #1502 for this.
Yes - for the pages that're not faulted/accessed at all. While for those that're allocated lazily (though can be very few w/
Right, exactly.
Sure, I'll approach Haitao et al. to see if any better option. |
I have two new notes: All #PF handling must be inside the SGX PAL
The original design by Kailun uses the LibOS's memory-fault handler. I overlooked this design choice, but now I'm certain that this is wrong. LibOS is arch-agnostic and must not even know that things like (minor) page faults exist. Also, calling So we actually need to intercept minor page faults in the SGX PAL:
This is also correct from the other PAL's view: the Linux PAL never generates such minor page faults (because they are done completely by the underlying Linux host). Thus, the proposed We must introduce a bitmap vectorThis bitmap vector will simply span the whole I'd like to stress that this bitmap vector is introduced purely for the #PF handling (minor page faults due to not-yet-committed enclave pages). The real bulk of the memory bookkeeping is still on the LibOS VMA subsystem, including page permissions. Unfortunately, there seems to be no way to have an implementation of this lazy EDMM feature with a completely stateless code... |
The way I see it EDMM sub-component tracks EPCM attributes which is SGX specific, and should not be in conflict with regular VMA attributes. EACCEPT bit map is one of them, even EPCM.R/W/X could be out of sync from libOS VMA record, I'm not sure also reusing standard mmap flags to signal whether a page is EAUG on #PF is reasonable because you may have situations when EAUG on #PF is also needed for other flags. But I'm not deep into gramine use cases, just something to consider. Intel SDK and MS open enclave SDK (probably other runtimes too given people are sending PRs and issues) are using the sgx-emm implementation which has separate tracking for all those and we didn't find it much overhead. You can find the rational on storing EPC states here: https://github.com/intel/sgx-emm/blob/main/design_docs/SGX_EDMM_driver_interface.md#enclave-handling-of-faults. |
@haitaohuang Thanks for your inputs! Some comments below.
Why would the EPCM attributes be out of sync from the LibOS VMA page permissions? I see no real-world scenario when this can happen/is beneficial.
I think you're confusing Gramine's purpose here. We are not reusing the flags for our own purposes, instead we want to emulate the lazy-allocation behavior of the Linux x86 kernel. One of the simple cases is this |
hmm... But shouldn't we also update this bitmap vector every time we add/remove enclave pages? Otherwise, during #PF handling, how can we know whether an enclave page X was actually eaccepted (so that we can add it if it was not)? |
Yes, of course, sorry for confusion. The bitmap must be updated on actual add/remove of enclave pages. What was I trying to say is that the only rationale for introducing this bitmap is to track lazy allocation of enclave pages via the #PF exceptions. |
This of course depends on use case and may not be applicable to gramine. In multithreading case, you may have one thread changes permissions, records that target permission in VMA, But EPCM is not changed yet. Say originally you have RW in both VMA and EPCM. After mprotect to change VMA to RX, before EMODPR, EMODPE, EACCEPT are done to finish change EPCM, another thread may come in and execute the code. In this window, EPCM=RW, but VMA=RX, #PF may happen. To handle the #PF, you need track EPCM.
Yeah, I misspoke when I say "reuse" those flags. So MAP_NORESERVE and similarly for other flags, it will be always EAUG on #PF once you think this is the way to go, I wonder if you would later use other criteria to determine. e.g. size of the area, special situation like stack/heap you may do a portion on demand. If you plan to support those, then in the end some kind of flags needed in PAL to track which range is on demand which are eagerly committed, and some more explicit indicator passed to PAL for the mode of EPC committing. BTW Linux kernel does not seem to do much for MAP_NORESERVE other than not reserving/accounting for swap. IIUC, it makes not much difference in terms of whether RAM is committed or not. "MAP_POPULATE do eager allocation, otherwise do lazy" seems to be a better heuristic. Not sure if it was considered. |
@haitaohuang Thanks again for more insights!
This race should be impossible in normal execution of Gramine. Gramine's LibOS VMA subsystem synchronizes mprotect requests internally. And if the application itself does it (one app thread performs mprotect, and another app thread accesses this same page), then it is a bug in the application, and Gramine is not supposed to "try to fix" bad behavior of the app.
Yes. After some more discussions with @kailun-qin, we currently think that we'll get away with the following metadata:
Note that we also don't want to change Gramine policy to the Linux one (always postpone committing pages, unless instructed otherwise). This would introduce tremendous performance overhead, due to the additional #PF flow, which is very expensive in SGX. That's why in Gramine, we kinda have a reverse logic -- we try to find the flags that hint at "this memory range will probably never be needed". One good hinting flag that we observed in several workloads (most notably in Java) is |
Description of the feature
PR #1054 implemented an initial version of EDMM support.
In particular, every
mmap(addr, size)
request ends up allocating the range of enclave pages[addr, addr+size)
, via the call topal_memory.c: sgx_edmm_add_pages()
which in turn does a loopsgx_eaccept
+ restrict/expand permissions over all pages in the range. In other words, the whole mmapped region is pre-accepted.In some cases, applications may rely on lazy allocation of pages, where the VMAs are reserved but not actually committed to physical memory. In particular,
mmap(..., MAP_NORESERVE)
requests are used in such cases -- to mmap a huge chunk of memory (possibly never used in the future) at once and then commit pages on demand on page fault events.So, our initial implementation of EDMM support doesn't have this concept of lazy allocation. Ideally, we would pre-accept the mmapped range only on some
mmap()
requests, and defer page accepts to page-fault events on othermmap()
requests. One obvious heuristic to defer page accepts is when Gramine noticesMAP_NORESERVE
flag in the mmap request.Why Gramine should implement it?
Performance reasons, as well as to decrease the amount of required EPC (physical SGX memory). E.g., Java runtime may issue
mmap(64GB, MAP_NORESERVE)
-- current EDMM implementation will spend a lot of time and physical memory on allocating + accepting all 64GB of enclave pages. The improved implementation would not allocate these 64GB enclave pages at all (only the actually required subset of pages will be allocated + accepted during page fault handling).The text was updated successfully, but these errors were encountered: