Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flush profile data on demand using SIGUSR1 signal #1751

Merged
merged 1 commit into from
Jun 26, 2024

Conversation

jkr0103
Copy link
Contributor

@jkr0103 jkr0103 commented Feb 5, 2024

Description of the changes

Currently Gramine don't support flushing sgx profiling data if user wants to collect only for a particular duraion (e.g., only during perf benchmarking runs). This PR add feature to flush the profile data on demand using SIGUSR1 signal. That way user can collect data only for a particular duration. This also eliminates the noise added in profile data duing start and stop of the process.

Partially fixes issue #1711

How to test this PR?

Please follow the instruction here to enable SGX profiling. Now execute the workload in Gramine SGX. Send SIGUSR1 signal using command kill -SIGUSR1 <pid> to flush the profile data. This should create a sgx profile data file which can be opened using perf report command.

Example:

  1. Add below manifest options to gramine/CI-Examples/redis/redis-server.manifest.template.
    sgx.profile.enable = "main"
    sgx.profile.with_stack = true
  2. Build the redis example: make clean && make SGX=1
  3. Run redis server: gramine-sgx redis-server &
  4. Get the redis server pid and run kill -SIGUSR1 <pid> to flush. Gramine flush the profile data in a file as indicated by below log line:
    Profile data written to sgx-perf-1707196865.data (161042 bytes)
  5. Run the benchmarks using command src/src/redis-benchmark
  6. Once benchmark run is complete issue the send SIGUSR1 signal to flush profile data using command kill -SIGUSR1 <pid>. Again profile data is written to a new file as indicated by below log line:
    Profile data written to sgx-perf-1707196952.data (15096242 bytes)
  7. sgx-perf-1707196952.data have the profile data collected during benchmarks run.
  8. Open the profile data: perf report --no-children -i sgx-perf-1707196952.data

Same steps can be tried with MongoDB, MySQL etc..


This change is Reviewable

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ) (waiting on @jkr0103)

a discussion (no related file):
This PR looks surprisingly simple. Nice! Thanks for this.


a discussion (no related file):
Could you copy-paste some examples from your experiments? I'd like to see how things look like in reality (before testing myself locally).

Also, what applications would you recommend to test?



Documentation/performance.rst line 636 at r1 (raw file):

signal which would help in cases where user would like to collect profile data
only for a particular period (e.g., during benchmarks run). Send SIGUSR1 signal
using command ``kill -SIGUSR1 <pid>``.

You should also mention that sending multiple SIGUSR1 will create multiple files, each containing profile data collected after the previous SIGUSR1. And that this is helpful when the application runs in clearly separated stages.


pal/src/host/linux-sgx/host_exception.c line 191 at r1 (raw file):

}

static void handle_async_sigusr1_signal(int signum, siginfo_t* info, struct ucontext* uc) {

That's a too long name. SIGUSR1 is already a "signal" and already a "asynchronous" one.

So just rename to handle_sigusr1()


pal/src/host/linux-sgx/host_exception.c line 201 at r1 (raw file):

        sgx_profile_init();

        struct debug_map* map = g_debug_map;

Please add the same comment from here:

/* Report all ELFs already loaded (host part of PAL and dynamic libraries used by it) */


pal/src/host/linux-sgx/host_profile.c line 127 at r1 (raw file):

    ret = DO_SYSCALL(clock_gettime, CLOCK_REALTIME, &ts);
    if (ret < 0) {
        log_error("sgx_profile_sample: clock_gettime failed: %s", unix_strerror(ret));

sgx_profile_init:


pal/src/host/linux-sgx/host_profile.c line 128 at r1 (raw file):

    if (ret < 0) {
        log_error("sgx_profile_sample: clock_gettime failed: %s", unix_strerror(ret));
        return false;

What is this? Should be goto out;


pal/src/host/linux-sgx/host_profile.c line 133 at r1 (raw file):

    if (g_pal_enclave.profile_append_pid_to_filename) {
        snprintf(g_pal_enclave.profile_filename, ARRAY_SIZE(g_pal_enclave.profile_filename),
                 SGX_PROFILE_FILENAME_WITH_PID, (int)g_host_pid, ts.tv_sec);

profile_filename is 64 chars in size. Now that we also add the number of seconds, are we sure we still fit in 64 chars limit? Please check how many chars in the filenames you see in your experiments.

I think we can increase to 128 chars in size, and this will be more than enough.


pal/src/host/linux-sgx/host_profile.c line 197 at r1 (raw file):

    g_mem_fd = -1;

    log_always("Profile data written to %s (%lu bytes)", g_pal_enclave.profile_filename, size);

Why this change?

Copy link
Contributor Author

@jkr0103 jkr0103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 5 files reviewed, 8 unresolved discussions, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv)

a discussion (no related file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Could you copy-paste some examples from your experiments? I'd like to see how things look like in reality (before testing myself locally).

Also, what applications would you recommend to test?

Done.



Documentation/performance.rst line 636 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

You should also mention that sending multiple SIGUSR1 will create multiple files, each containing profile data collected after the previous SIGUSR1. And that this is helpful when the application runs in clearly separated stages.

Done.


pal/src/host/linux-sgx/host_exception.c line 191 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

That's a too long name. SIGUSR1 is already a "signal" and already a "asynchronous" one.

So just rename to handle_sigusr1()

Done.


pal/src/host/linux-sgx/host_exception.c line 201 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Please add the same comment from here:

/* Report all ELFs already loaded (host part of PAL and dynamic libraries used by it) */

Done.


pal/src/host/linux-sgx/host_profile.c line 127 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

sgx_profile_init:

Done.


pal/src/host/linux-sgx/host_profile.c line 128 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

What is this? Should be goto out;

Done.


pal/src/host/linux-sgx/host_profile.c line 133 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

profile_filename is 64 chars in size. Now that we also add the number of seconds, are we sure we still fit in 64 chars limit? Please check how many chars in the filenames you see in your experiments.

I think we can increase to 128 chars in size, and this will be more than enough.

Done.


pal/src/host/linux-sgx/host_profile.c line 197 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Why this change?

User might not want to enable debug logs due to performance reasons in that case user won't know the file where profile data is written. Also this log is seen only when profiling is enabled, in that case we should always inform user about the file name where profile data is written.

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r2, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @jkr0103)

a discussion (no related file):

Previously, jkr0103 (Jitender Kumar) wrote…

Done.

Done where? UPDATE: You updated the PR description, I see now.



pal/src/host/linux-sgx/host_exception.c line 201 at r2 (raw file):

        sgx_profile_init();

        /* Report all ELFs already loaded (host part of PAL and dynamic libraries used by it) */

Ah, sorry, the text in parenthesis is incorrect in this case -- it's not only the PAL binary and its dynamic-lib dependencies. It's also the application executable itself, the LibOS binary, etc.

So please shorten to just: /* Report all ELFs already loaded */


pal/src/host/linux-sgx/host_profile.c line 133 at r1 (raw file):

Previously, jkr0103 (Jitender Kumar) wrote…

Done.

Btw, I looked at the sizes of the longest possible strings (sgx-perf- prefix + PID + timestamp + .data suffix). Typically you'll get ~32 chars. So 64 chars would also be enough.

Nevertheless, I don't see any harm in increasing from 64 chars to 128 chars.


pal/src/host/linux-sgx/host_profile.c line 197 at r1 (raw file):

Previously, jkr0103 (Jitender Kumar) wrote…

User might not want to enable debug logs due to performance reasons in that case user won't know the file where profile data is written. Also this log is seen only when profiling is enabled, in that case we should always inform user about the file name where profile data is written.

You're right, that's a good point.

Copy link
Contributor Author

@jkr0103 jkr0103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 5 files reviewed, 1 unresolved discussion, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv)


pal/src/host/linux-sgx/host_exception.c line 201 at r2 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Ah, sorry, the text in parenthesis is incorrect in this case -- it's not only the PAL binary and its dynamic-lib dependencies. It's also the application executable itself, the LibOS binary, etc.

So please shorten to just: /* Report all ELFs already loaded */

Done.

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, all discussions resolved, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners

@dimakuv
Copy link
Contributor

dimakuv commented Feb 6, 2024

Jenkins, test this please

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, all discussions resolved, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners

a discussion (no related file):
I tested the PR, looks like it's working. I followed the instructions in the PR description and tested Redis on my local machine.

I did 4 steps:

  1. Start Redis and perform kill -SIGUSR1 after 15 seconds; this is the enclave startup time.
  2. Wait for 15 seconds and perform kill -SIGUSR1 again; this is the do-nothing idle time.
  3. Start ./src/src/redis-benchmark -t set,get, wait till it finishes (around 15 seconds) and perform kill -SIGUSR1 again; this is the actual benchmark-working time.
  4. Wait for 15 seconds and finally terminate Redis via kill -SIGTERM; this is the do-nothing idle time.

So, a total of 1 minute.

sgx-perf-1707208751.data (178147 bytes) -- startup (15s)

Samples: 21
-   76.19%  pal-sgx  libpal.so      [.] copy_u64s
     copy_u64s                                   
     copy_u64s_from_untrusted                    
     sgx_copy_to_enclave_verified                
   - sgx_copy_to_enclave                         
      + 66.67% load_trusted_or_allowed_file      
      + 9.52% copy_and_verify_trusted_file       
-   19.05%  pal-sgx  libpal.so      [.] memset   
     _real_memset (inlined)                      
     _PalVirtualMemoryAlloc                      
     PalVirtualMemoryAlloc                       
     libos_syscall_mmap                          
+    4.76%  pal-sgx  libsysdb.so    [.] snprintf_callback

sgx-perf-1707208766.data (18547 bytes) -- idle before bench run (15s)

Samples: 2
-   50.00%  pal-sgx  libpal.so      [.] copy_u64s    
     copy_u64s                                       
     copy_u64s_from_untrusted                        
     sgx_copy_to_enclave_verified                    
     alloc_untrusted_futex_word                      
     _PalEventCreate                                 
     PalEventCreate                                  
     create_lock                                     
     get_new_handle                                  
     libos_syscall_openat                            
-   50.00%  pal-sgx  libpal.so      [.] slab_alloc   
     slab_alloc                                      
     malloc                                          
     calloc                                          
     _PalEventCreate                                 
     PalEventCreate                                  
     create_lock                                     
     get_new_handle                                  
     libos_syscall_openat                            

sgx-perf-1707208782.data (2059747 bytes) -- benchmark run (15s)

Samples: 245
+   38.37%  pal-sgx  libpal.so         [.] enclave_entry
+    4.08%  pal-sgx  libpal.so         [.] copy_u64s
+    2.45%  pal-sgx  libpal.so         [.] sgx_ocall
+    2.04%  pal-sgx  libpal.so         [.] __save_xregs
+    2.04%  pal-sgx  redis-server      [.] je_malloc_usable_size
+    1.63%  pal-sgx  libpal.so         [.] __restore_xregs
+    1.63%  pal-sgx  libpal.so         [.] sgx_copy_from_enclave_verified
+    1.63%  pal-sgx  libsysdb.so       [.] _return_from_syscall
+    1.63%  pal-sgx  redis-server      [.] siphash_nocase
+    1.22%  pal-sgx  libpal.so         [.] sgx_exitless_ocall
+    1.22%  pal-sgx  libpal.so         [.] sgx_reset_ustack
+    1.22%  pal-sgx  libsysdb.so       [.] cmp_addr_to_vma
+    1.22%  pal-sgx  libsysdb.so       [.] libos_syscall_gettimeofday
+    1.22%  pal-sgx  libsysdb.so       [.] truncate_file_name
+    0.82%  pal-sgx  libc.so.6         [.] __GI___libc_write
+    0.82%  pal-sgx  libc.so.6         [.] __strcasecmp_l_avx2
+    0.82%  pal-sgx  libc.so.6         [.] read
+    0.82%  pal-sgx  libpal.so         [.] PalSocketSend
+    0.82%  pal-sgx  libpal.so         [.] copy_u64s_to_untrusted
+    0.82%  pal-sgx  libpal.so         [.] memcpy
+    0.82%  pal-sgx  libpal.so         [.] ocall_send
+    0.82%  pal-sgx  libpal.so         [.] sgx_is_valid_untrusted_ptr
+    0.82%  pal-sgx  libsysdb.so       [.] avl_tree_lower_bound_fn

sgx-perf-1707208796.data (18547 bytes) -- idle after bench run (15s)

Samples: 2
-  100.00%  pal-sgx  libpal.so      [.] copy_u64s     
     copy_u64s                                        
     copy_u64s_from_untrusted                         
     sgx_copy_to_enclave_verified                     
     alloc_untrusted_futex_word                       
     _PalEventCreate                                  
     PalEventCreate                                   
     create_lock                                      
     get_new_handle                                   
     libos_syscall_openat 

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @jkr0103)


pal/src/host/linux-sgx/host_exception.c line 198 at r3 (raw file):

#ifdef DEBUG
    if (g_pal_enclave.profile_enable) {
        sgx_profile_finish();

This PR breaks on multi-threaded apps. I tested on Memcached and got SIGSEGV.

That's because some other thread may be in the middle of saving something into the profile, and the profile data/metadata is "swept" from its feet.

We already have a global spinlock that protects all operations on profiling, so we just need to surround all this logic with a single lock/unlock. Let me prepare a fixup.

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 6 files reviewed, 1 unresolved discussion, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners


pal/src/host/linux-sgx/host_exception.c line 198 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

This PR breaks on multi-threaded apps. I tested on Memcached and got SIGSEGV.

That's because some other thread may be in the middle of saving something into the profile, and the profile data/metadata is "swept" from its feet.

We already have a global spinlock that protects all operations on profiling, so we just need to surround all this logic with a single lock/unlock. Let me prepare a fixup.

Done. The code is pretty ugly and will need some re-arrangements, and because we don't protect g_mem_fd with a lock, we sometimes see (rather benign) error messages. Someone (me?) will need to make the code look better. But functionality-wise seems to be ok.

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r2, all commit messages.
Reviewable status: 1 of 6 files reviewed, 5 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @jkr0103)

a discussion (no related file):
Why is this implemented in PAL-SGX? Profiling is a generic feature in Gramine, not SGX-specific.



-- commits line 2 at r4:
missing component tag


Documentation/performance.rst line 639 at r4 (raw file):

multiple files, each containing profile data collected after the previous
``SIGUSR1`` which is helpful when the application runs in clearly separated
stages.

This has to be rewritten, it's hard to read.


pal/src/host/linux-sgx/host_exception.c line 198 at r4 (raw file):

#ifdef DEBUG
    if (g_pal_enclave.profile_enable) {
        sgx_profile_reinit();

You can't call this from a signal handler for like 10 different reasons, e.g. because it tries to take a lock which could have been already taken by the current thread when SIGUSR1 arrived, causing a deadlock.

Copy link
Contributor

@mwkmwkmwk mwkmwkmwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 6 files reviewed, 5 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv, @jkr0103, and @mkow)


pal/src/host/linux-sgx/host_exception.c line 198 at r4 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

You can't call this from a signal handler for like 10 different reasons, e.g. because it tries to take a lock which could have been already taken by the current thread when SIGUSR1 arrived, causing a deadlock.

Indeed. I think the best way would be to just set an atomic flag in the signal, and do the flush the next time profile data is collected (eg. in sgx_profile_sample_aex), which should happen often enough.

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 6 files reviewed, 5 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @jkr0103 and @mkow)

a discussion (no related file):

Previously, mkow (Michał Kowalczyk) wrote…

Why is this implemented in PAL-SGX? Profiling is a generic feature in Gramine, not SGX-specific.

There is nothing required for non-SGX PAL (gramine-direct). This kind of profiling just works, it doesn't require any additional features.

So we're augmenting already-existing SGX PAL support for profiling. See https://gramine.readthedocs.io/en/stable/performance.html#profiling-with-perf



pal/src/host/linux-sgx/host_exception.c line 198 at r4 (raw file):

Previously, mwkmwkmwk (Marcelina Kościelnicka) wrote…

Indeed. I think the best way would be to just set an atomic flag in the signal, and do the flush the next time profile data is collected (eg. in sgx_profile_sample_aex), which should happen often enough.

Yeah, I agree...

The code got super-ugly now, @jkr0103 or me will need to sort this all out.

Copy link
Contributor Author

@jkr0103 jkr0103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 6 files reviewed, 5 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)


pal/src/host/linux-sgx/host_exception.c line 198 at r4 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Yeah, I agree...

The code got super-ugly now, @jkr0103 or me will need to sort this all out.

Done.

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r5, all commit messages.
Reviewable status: all files reviewed, 16 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @jkr0103 and @mkow)


-- commits line 2 at r4:

Previously, mkow (Michał Kowalczyk) wrote…

missing component tag

Previously we had a prefix like this: [Pal/Linux-SGX] SGX perf:

So I suggest to change the commit message to

[Pal/Linux-SGX] SGX perf: flush profile data on SIGUSR1 signal

This allows to collect profile data only for a particular period
(e.g., during Gramine startup or during an actual benchmark run).

Documentation/performance.rst line 639 at r4 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

This has to be rewritten, it's hard to read.

Proposed text:

It is also possible to flush the collected profile data interactively, using the
``SIGUSR1`` signal. This helps to collect profile data only for a particular
period, e.g., skipping the Gramine startup and application initialization time
and concentrating only on the actual application processing. Send ``SIGUSR1``
using command ``kill -SIGUSR1 <gramine-pid>``. Sending multiple ``SIGUSR1`` will
create multiple files, each containing profile data collected after the previous
``SIGUSR1``.

pal/src/host/linux-sgx/host_exception.c line 198 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Done. The code is pretty ugly and will need some re-arrangements, and because we don't protect g_mem_fd with a lock, we sometimes see (rather benign) error messages. Someone (me?) will need to make the code look better. But functionality-wise seems to be ok.

This was fixed by @jkr0103


pal/src/host/linux-sgx/host_internal.h line 134 at r5 (raw file):

#ifdef DEBUG
#include "elf/elf.h"

What is this? Why is this needed in this header?


pal/src/host/linux-sgx/host_internal.h line 155 at r5 (raw file):

/* Finalize and close file */
void sgx_profile_finish(void);
void sgx_profile_finish_thread_unsafe(void);

This function must be declared as static in the *.c file, and removed from here.


pal/src/host/linux-sgx/host_internal.h line 158 at r5 (raw file):

/* Re-initialize based on g_pal_enclave settings */
int sgx_profile_reinit(void);

This function must be declared as static in the *.c file, and removed from here.


pal/src/host/linux-sgx/host_internal.h line 176 at r5 (raw file):

/* Record a new mapped ELF */
void sgx_profile_report_elf(const char* filename, void* addr);
void sgx_profile_report_elf_thread_unsafe(const char* filename, void* addr);

Why do you need to declare these 4 functions? You should just declare those functions as static in the *.c file, and remove from here.


pal/src/host/linux-sgx/host_profile.c line 177 at r5 (raw file):

}

void sgx_profile_finish_thread_unsafe(void) {

static void


pal/src/host/linux-sgx/host_profile.c line 197 at r5 (raw file):

    g_profile_enabled = false;
}

Please add a newline after this


pal/src/host/linux-sgx/host_profile.c line 203 at r5 (raw file):

    spinlock_unlock(&g_perf_data_lock);
}

Please remove a newline


pal/src/host/linux-sgx/host_profile.c line 454 at r5 (raw file):

            log_error("sgx_profile_report_elf(%s): close failed: %s", filename, unix_strerror(ret));
    }
}

Please add a newline after this


pal/src/host/linux-sgx/host_profile.c line 481 at r5 (raw file):

        sgx_profile_report_elf_cleanup(filename, &elf_addr, elf_length, fd);
    }
}

Sorry, this refactoring of sgx_profile_report_elf() turns out to be a complete mess...

I suggest to do the same you did with sgx_profile_finish() -- just remove the locks from the original function and rename it to _thread_unsafe(), and in the wrapper function add the locks. Yes, this will be less performant than the original code (because the critical section in-between lock and unlock becomes larger), but this function is called very rarely, only on new binaries/libraries loaded into the application.


pal/src/host/linux-sgx/host_profile.c line 484 at r5 (raw file):

int sgx_profile_reinit(void) {
    int ret;

Please add in the very beginning: assert(spinlock_is_locked(&g_perf_data_lock));


pal/src/host/linux-sgx/host_profile.c line 489 at r5 (raw file):

    ret = sgx_profile_init();
    if (ret < 0) {
        log_error("sgx_profile_reinit: initializing again failed: %s", unix_strerror(ret));

Actually no need for "initializing again", as the name of the func already explains the issue. So: sgx_profile_reinit failed: %s

Copy link
Contributor Author

@jkr0103 jkr0103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 3 of 6 files reviewed, 16 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)


-- commits line 2 at r4:

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Previously we had a prefix like this: [Pal/Linux-SGX] SGX perf:

So I suggest to change the commit message to

[Pal/Linux-SGX] SGX perf: flush profile data on SIGUSR1 signal

This allows to collect profile data only for a particular period
(e.g., during Gramine startup or during an actual benchmark run).

Could you please add it while rebasing the PR?


Documentation/performance.rst line 639 at r4 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Proposed text:

It is also possible to flush the collected profile data interactively, using the
``SIGUSR1`` signal. This helps to collect profile data only for a particular
period, e.g., skipping the Gramine startup and application initialization time
and concentrating only on the actual application processing. Send ``SIGUSR1``
using command ``kill -SIGUSR1 <gramine-pid>``. Sending multiple ``SIGUSR1`` will
create multiple files, each containing profile data collected after the previous
``SIGUSR1``.

Done.


pal/src/host/linux-sgx/host_exception.c line 198 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

This was fixed by @jkr0103

Done.


pal/src/host/linux-sgx/host_internal.h line 134 at r5 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

What is this? Why is this needed in this header?

It was required for elf_ehdr_t but not required anymore as those function declarations are moved out of this file.


pal/src/host/linux-sgx/host_internal.h line 155 at r5 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

This function must be declared as static in the *.c file, and removed from here.

Done.


pal/src/host/linux-sgx/host_internal.h line 158 at r5 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

This function must be declared as static in the *.c file, and removed from here.

Done.


pal/src/host/linux-sgx/host_internal.h line 176 at r5 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Why do you need to declare these 4 functions? You should just declare those functions as static in the *.c file, and remove from here.

These functions are not required now, hence deleted.


pal/src/host/linux-sgx/host_profile.c line 177 at r5 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

static void

Done.


pal/src/host/linux-sgx/host_profile.c line 197 at r5 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Please add a newline after this

Done.


pal/src/host/linux-sgx/host_profile.c line 203 at r5 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Please remove a newline

Done.


pal/src/host/linux-sgx/host_profile.c line 454 at r5 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Please add a newline after this

Done.


pal/src/host/linux-sgx/host_profile.c line 481 at r5 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Sorry, this refactoring of sgx_profile_report_elf() turns out to be a complete mess...

I suggest to do the same you did with sgx_profile_finish() -- just remove the locks from the original function and rename it to _thread_unsafe(), and in the wrapper function add the locks. Yes, this will be less performant than the original code (because the critical section in-between lock and unlock becomes larger), but this function is called very rarely, only on new binaries/libraries loaded into the application.

Done.


pal/src/host/linux-sgx/host_profile.c line 484 at r5 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Please add in the very beginning: assert(spinlock_is_locked(&g_perf_data_lock));

Done.


pal/src/host/linux-sgx/host_profile.c line 489 at r5 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Actually no need for "initializing again", as the name of the func already explains the issue. So: sgx_profile_reinit failed: %s

Done.

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r6, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv, @jkr0103, and @mkow)


-- commits line 2 at r4:

Previously, jkr0103 (Jitender Kumar) wrote…

Could you please add it while rebasing the PR?

Yes I will


pal/src/host/linux-sgx/host_internal.h line 165 at r6 (raw file):

/* Record a sample during OCALL (function to be executed) */
void sgx_profile_sample_ocall_outer(void* ocall_func);

Please re-add the accidentally removed line


pal/src/host/linux-sgx/host_profile.c line 206 at r6 (raw file):

static void sgx_profile_report_elf_thread_unsafe(const char* filename, void* addr) {
    int ret;

Can you add assert(spinlock_is_locked(&g_perf_data_lock)); as the very first line?


pal/src/host/linux-sgx/host_profile.c line 324 at r6 (raw file):

    if (ret < 0) {
        log_error("sgx_profile_reinit failed: %s", unix_strerror(ret));
        goto out;

There seems to be no reason to have out: label. Just return ret here, and return 0 at the end of the func.

Copy link
Contributor Author

@jkr0103 jkr0103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 6 files reviewed, 8 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)


pal/src/host/linux-sgx/host_internal.h line 165 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Please re-add the accidentally removed line

Done.


pal/src/host/linux-sgx/host_profile.c line 206 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Can you add assert(spinlock_is_locked(&g_perf_data_lock)); as the very first line?

This function is file-scope only and callers grabed or assert on lock, hence adding this assert seems redundant.


pal/src/host/linux-sgx/host_profile.c line 324 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

There seems to be no reason to have out: label. Just return ret here, and return 0 at the end of the func.

Done.

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r7, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @jkr0103 and @mkow)


pal/src/host/linux-sgx/host_profile.c line 206 at r6 (raw file):

Previously, jkr0103 (Jitender Kumar) wrote…

This function is file-scope only and callers grabed or assert on lock, hence adding this assert seems redundant.

LGTM, I'm ok with this explanation. (Indeed, very easy to verify that the lock is taken.)


pal/src/host/linux-sgx/host_profile.c line 57 at r7 (raw file):

    while (total < size) {
        ret = DO_SYSCALL(pread64, g_mem_fd, (uint8_t*)dest + total, size - total,

Technically accessing g_mem_fd must be under the g_perf_data_lock spinlock, because this variable can become -1 during re-opening of the file.

On the other hand, adding locking here would lead to tremendous overheads of profiling, which will seriously distort the perf results. So I'm very reluctant to do proper thread safety here...

In the worst case, pread64() will return smth like EBADF, and the sample will be lost. That's what I observe in practice when testing this PR. Sounds like a small correctness price to pay, so I would leave as is.

@dimakuv dimakuv requested a review from mkow March 8, 2024 11:34
Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r8, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @jkr0103 and @mkow)

a discussion (no related file):
Bar the (non-blocking) discussion point, I am fine with this PR.

I would like to test it again after all reviews are done and before merge.


Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 7 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @jkr0103 and @mkow)


pal/src/host/linux-sgx/host_profile.c line 57 at r7 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Technically accessing g_mem_fd must be under the g_perf_data_lock spinlock, because this variable can become -1 during re-opening of the file.

On the other hand, adding locking here would lead to tremendous overheads of profiling, which will seriously distort the perf results. So I'm very reluctant to do proper thread safety here...

In the worst case, pread64() will return smth like EBADF, and the sample will be lost. That's what I observe in practice when testing this PR. Sounds like a small correctness price to pay, so I would leave as is.

I thought about this a bit more, and I have the following compromise solution:

  1. We introduce a global g_null_mem_fd which is opened at init time and points to the pseudo-file "/dev/null" (i.e., all writes to this FD are no-ops).
  2. Instead of closing previous g_mem_fd, we dup2() the g_null_mem_fd into the previous mem FD. In other words, all writes to the previous mem FD do not return with error, but instead are silently dropped. Note that dup2() is an atomic operation, which is what we want.
  3. So all threads that are still using the previous mem FD will not fail but instead just do nothing. At some point these threads will notice the new mem FD and will start writing their samples into this new mem FD.

This also solves the problem that I didn't realize when I wrote my previous comment (and that problem is very serious):

  • One thread closes g_mem_fd, now the FD number (say 10) that was used for mem FD is available to any other usage.
  • Another thread still wants to use this old mem FD (i.e. pread(10)), and is preempted.
  • Yet another thread wants to open some unrelated file. It notices that FD 10 is available and assigns 10 to this new file.
  • Second thread is scheduled again. It now performs pread() from a completely irrelevant file. Gramine state becomes silently corrupted. Oops.

There is a problem that we will leak g_mem_fd and g_perf_data::pd on every SIGUSR1 signal, but it's a small price to pay. (I think the only alternative would be to use spinlocks everywhere, which is very expensive.)


All the above also applies to g_perf_data::pd FD.

Copy link
Contributor Author

@jkr0103 jkr0103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 5 of 7 files reviewed, 7 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)


pal/src/host/linux-sgx/host_profile.c line 57 at r7 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I thought about this a bit more, and I have the following compromise solution:

  1. We introduce a global g_null_mem_fd which is opened at init time and points to the pseudo-file "/dev/null" (i.e., all writes to this FD are no-ops).
  2. Instead of closing previous g_mem_fd, we dup2() the g_null_mem_fd into the previous mem FD. In other words, all writes to the previous mem FD do not return with error, but instead are silently dropped. Note that dup2() is an atomic operation, which is what we want.
  3. So all threads that are still using the previous mem FD will not fail but instead just do nothing. At some point these threads will notice the new mem FD and will start writing their samples into this new mem FD.

This also solves the problem that I didn't realize when I wrote my previous comment (and that problem is very serious):

  • One thread closes g_mem_fd, now the FD number (say 10) that was used for mem FD is available to any other usage.
  • Another thread still wants to use this old mem FD (i.e. pread(10)), and is preempted.
  • Yet another thread wants to open some unrelated file. It notices that FD 10 is available and assigns 10 to this new file.
  • Second thread is scheduled again. It now performs pread() from a completely irrelevant file. Gramine state becomes silently corrupted. Oops.

There is a problem that we will leak g_mem_fd and g_perf_data::pd on every SIGUSR1 signal, but it's a small price to pay. (I think the only alternative would be to use spinlocks everywhere, which is very expensive.)


All the above also applies to g_perf_data::pd FD.

g_mem_fd need not be closed/open again and again on SIGUSR1, we can keep it open for the lifetime of the process. @dimakuv do you see any issue with this approach?

I have implemented the solution (using dup2) suggestion above for g_perf_data::pd.

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r9, all commit messages.
Reviewable status: all files reviewed, 13 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @jkr0103 and @mkow)


pal/src/host/linux-sgx/host_perf_data.c line 155 at r9 (raw file):

    }

    // File `/dev/null` is open once and remain open during the lifecycle of the process

I would rephrase to this:

/* `/dev/null` is opened once and is kept open for the whole runtime of Gramine */

pal/src/host/linux-sgx/host_perf_data.c line 158 at r9 (raw file):

    if (g_dev_null_fd < 0) {
        ret = DO_SYSCALL(open, "/dev/null",
                            O_WRONLY | O_TRUNC | O_CREAT | O_CLOEXEC, PERM_rw_r__r__);

Please align O_WRONLY ... line to be exactly under open, ...


pal/src/host/linux-sgx/host_perf_data.c line 158 at r9 (raw file):

    if (g_dev_null_fd < 0) {
        ret = DO_SYSCALL(open, "/dev/null",
                            O_WRONLY | O_TRUNC | O_CREAT | O_CLOEXEC, PERM_rw_r__r__);

Please remove O_TRUNC and O_CREAT -- they don't make sense in this context


pal/src/host/linux-sgx/host_perf_data.c line 160 at r9 (raw file):

                            O_WRONLY | O_TRUNC | O_CREAT | O_CLOEXEC, PERM_rw_r__r__);
        if (ret < 0) {
            log_error("pd_open: cannot open `/dev/null` for writing: %s", unix_strerror(fd));

Not fd but ret


pal/src/host/linux-sgx/host_perf_data.c line 298 at r9 (raw file):

out:
    if ((dup2_ret = dup2(g_dev_null_fd, pd->fd)) < 0)

We don't use Glibc wrappers here (dup2()) but instead raw syscalls (DO_SYSCALL(dup2, g_dev_null_fd, pd->fd)). Please fix.


pal/src/host/linux-sgx/host_profile.c line 57 at r7 (raw file):

g_mem_fd need not be closed/open again and again on SIGUSR1, we can keep it open for the lifetime of the process. @dimakuv do you see any issue with this approach?

Yes, you are right. This is good how you implemented it.

I have implemented the solution (using dup2) suggestion above for g_perf_data::pd.

Yes, looks like what I wanted.


pal/src/host/linux-sgx/host_profile.c line 116 at r9 (raw file):

    g_profile_mode = g_pal_enclave.profile_mode;

    // File `/proc/self/mem` is open once and remain open during the lifecycle of the process

I would rephrase to this:

/* `/proc/self/mem` is opened once and is kept open for the whole runtime of Gramine */

pal/src/host/linux-sgx/host_profile.c line 121 at r9 (raw file):

        if (ret < 0) {
            log_error("sgx_profile_init: opening /proc/self/mem failed: %s", unix_strerror(ret));
            return ret;

Why did you replace all goto out with return ret? This looked reasonable before. Now we have sometimes return ret and sometimes goto out, which is more confusing.

Copy link
Contributor Author

@jkr0103 jkr0103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 5 of 7 files reviewed, 13 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)


pal/src/host/linux-sgx/host_perf_data.c line 155 at r9 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I would rephrase to this:

/* `/dev/null` is opened once and is kept open for the whole runtime of Gramine */

Done.


pal/src/host/linux-sgx/host_perf_data.c line 158 at r9 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Please align O_WRONLY ... line to be exactly under open, ...

Done.


pal/src/host/linux-sgx/host_perf_data.c line 158 at r9 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Please remove O_TRUNC and O_CREAT -- they don't make sense in this context

Done.


pal/src/host/linux-sgx/host_perf_data.c line 160 at r9 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Not fd but ret

Done.


pal/src/host/linux-sgx/host_perf_data.c line 298 at r9 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

We don't use Glibc wrappers here (dup2()) but instead raw syscalls (DO_SYSCALL(dup2, g_dev_null_fd, pd->fd)). Please fix.

Done.


pal/src/host/linux-sgx/host_profile.c line 116 at r9 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I would rephrase to this:

/* `/proc/self/mem` is opened once and is kept open for the whole runtime of Gramine */

Done.


pal/src/host/linux-sgx/host_profile.c line 121 at r9 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Why did you replace all goto out with return ret? This looked reasonable before. Now we have sometimes return ret and sometimes goto out, which is more confusing.

reverted now.

dimakuv
dimakuv previously approved these changes Apr 15, 2024
Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r10, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @jkr0103 and @mkow)

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 5 files at r5, 1 of 4 files at r8, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @jkr0103)


-- commits line 2 at r4:

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Yes I will

[Pal/Linux-SGX] SGX perf -> [Pal/Linux-SGX] perf


Documentation/performance.rst line 638 at r10 (raw file):

initialization time and concentrating only on the actual application processing.
Send ``SIGUSR1`` using command ``kill -SIGUSR1 -<gramine-pid>``
(note the minus sign before <gramine-pid>). Sending multiple ``SIGUSR1`` will

Do we want this? Why?
Also, then it's not gramine-pid, but PGID.

Code quote:

(note the minus sign before <gramine-pid>).

pal/src/host/linux-sgx/host_profile.c line 164 at r10 (raw file):

}

static void sgx_profile_finish_thread_unsafe(void) {

What does "unsafe" mean in this context?


pal/src/host/linux-sgx/host_profile.c line 186 at r10 (raw file):

}

static void sgx_profile_report_elf_thread_unsafe(const char* filename, void* addr) {

Could you move this function down where it was before? It obscures the diff and it's hard to review. If you want to move functions around, I'd prefer if you did that in a follow-up PR where you only change the order of them.

Copy link
Contributor Author

@jkr0103 jkr0103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 5 of 7 files reviewed, 7 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)


-- commits line 2 at r4:

Previously, mkow (Michał Kowalczyk) wrote…

[Pal/Linux-SGX] SGX perf -> [Pal/Linux-SGX] perf

will be done while rebasing.


Documentation/performance.rst line 638 at r10 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Do we want this? Why?
Also, then it's not gramine-pid, but PGID.

Some workloads create multiple child process e.g. nginx have option worker_processes 4; which means it will create 4 child process. Minus sign is required to send the signal to all the processes in the group.


pal/src/host/linux-sgx/host_profile.c line 164 at r10 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

What does "unsafe" mean in this context?

unsafe means it's not safe for multiple thread, hence require locking,. User of this function should take care of locking before using this function if expecting multiple threads to enter at the same time.


pal/src/host/linux-sgx/host_profile.c line 186 at r10 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Could you move this function down where it was before? It obscures the diff and it's hard to review. If you want to move functions around, I'd prefer if you did that in a follow-up PR where you only change the order of them.

That required forward declaration. Anyways moved it down and added forward declaration.

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 5 files at r5, 1 of 4 files at r8, 1 of 2 files at r10, 1 of 2 files at r11, all commit messages.
Reviewable status: 6 of 7 files reviewed, 10 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @jkr0103)


pal/src/host/linux-sgx/host_exception.c line 196 at r11 (raw file):

    __UNUSED(uc);

#ifdef DEBUG

Please move it up to #ifdef the whole function.


pal/src/host/linux-sgx/host_exception.c line 241 at r11 (raw file):

        goto err;

    ret = set_signal_handler(SIGUSR1, handle_sigusr1);

Please also wrap this is #ifdef DEBUG.


pal/src/host/linux-sgx/host_perf_data.c line 297 at r11 (raw file):

out:
    if ((dup2_ret = DO_SYSCALL(dup2, g_dev_null_fd, pd->fd)) < 0)

Please keep it in a separate line, like other places in our code do.


pal/src/host/linux-sgx/host_perf_data.c line 297 at r11 (raw file):

out:
    if ((dup2_ret = DO_SYSCALL(dup2, g_dev_null_fd, pd->fd)) < 0)

What is the exact race this change tries to prevent? And how could it possibly work if there's no synchronization point between closing/swapping this descriptor and freeing the memory holding pd?

It seems to me that if close(...) was racy before, then free(pd) still is.


pal/src/host/linux-sgx/host_perf_data.c line 300 at r11 (raw file):

        log_error("pd_close: dup2 failed: %s", unix_strerror(dup2_ret));

    free(pd);

pd->fd leaks here


pal/src/host/linux-sgx/host_profile.c line 164 at r10 (raw file):

Previously, jkr0103 (Jitender Kumar) wrote…

unsafe means it's not safe for multiple thread, hence require locking,. User of this function should take care of locking before using this function if expecting multiple threads to enter at the same time.

Ah, ok, but no other function in this file (nor any other in Gramine) uses such naming convention. I thought it's "finish thread" with an "unsafe" suffix, not "thread unsafe" suffix.
If you want to require locking by the caller then just assert the relevant lock state.

Update:
Hmm, I guess you used this name because there exists also sgx_profile_finish with the locking, which is public. We prepend a single underscore for such cases + assert, see e.g.:

static void _maybe_dequeue_futex(struct libos_futex* futex) {
assert(spinlock_is_locked(&futex->lock));
assert(spinlock_is_locked(&g_futex_tree_lock));
if (check_dequeue_futex(futex)) {
avl_tree_delete(&g_futex_tree, &futex->tree_node);
futex->in_tree = false;
/* We still hold this futex reference (in the caller), so this won't call free. */
put_futex(futex);
}
}
/*
* If `futex` has no waiters and is on `g_futex_tree`, takes it off that tree.
*
* Neither `g_futex_tree_lock` nor `futex->lock` should be held while calling this,
* it acquires these locks itself.
*/
static void maybe_dequeue_futex(struct libos_futex* futex) {
spinlock_lock(&g_futex_tree_lock);
spinlock_lock(&futex->lock);
_maybe_dequeue_futex(futex);
spinlock_unlock(&futex->lock);
spinlock_unlock(&g_futex_tree_lock);
}


pal/src/host/linux-sgx/host_profile.c line 40 at r11 (raw file):

_WITH_PID

This suffix only made sense before, when there were more possible names. And it's even misleading now, it mentions the PID but not the time.


pal/src/host/linux-sgx/host_profile.c line 178 at r11 (raw file):

    log_always("Profile data written to %s (%lu bytes)", g_pal_enclave.profile_filename, size);

    g_profile_enabled = false;

This variable is read in many places without a lock because it was assumed to be constant after initialization, you cannot change it here.

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 6 of 7 files reviewed, 10 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @jkr0103)


-- commits line 2 at r4:

Previously, jkr0103 (Jitender Kumar) wrote…

will be done while rebasing.

(and also, Pal -> PAL)

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 5 of 7 files reviewed, 5 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @jkr0103 and @mkow)

a discussion (no related file):

Previously, mkow (Michał Kowalczyk) wrote…

At the same time, the low-level code in host_perf_data.c operates on a pointer to struct perf_data*, so gives the (unused) flexibility to operate on several objects simultaneously. Technically, host_perf_data.c could just always operate on global g_perf_data, but the current way the code is written is nicer and more extensible.

I think the reason for this being a pointer is to allocate this struct only when it's actually used - because the buffer inside it takes 32MB.

For ease of review and merge, we can split the proposed solution into several PRs:

I think we could keep it as a single PR? It's just 100 lines diff now, and probably won't be much more with this proposal? But I don't care about that too much, we can as well split.

Otherwise I like the proposal.

Ok, then let's keep as one PR.

@jkr0103 Will you do this new approach?


Copy link
Contributor Author

@jkr0103 jkr0103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 7 files reviewed, 5 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)

a discussion (no related file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Ok, then let's keep as one PR.

@jkr0103 Will you do this new approach?

done



pal/src/host/linux-sgx/host_profile.c line 178 at r11 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

It's still a race condition if you write with locks, but read without.

Reverted the changes here based on new approach discussed above.

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r14, all commit messages.
Reviewable status: all files reviewed, 23 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @jkr0103 and @mkow)


pal/src/host/linux-sgx/host_internal.h line 59 at r14 (raw file):

extern struct pal_enclave g_pal_enclave;
extern bool trigger_profile_reinit;

Please add g_ prefix -- all global (non-function-scope) variables must have this prefix.


pal/src/host/linux-sgx/host_internal.h line 60 at r14 (raw file):

extern struct pal_enclave g_pal_enclave;
extern bool trigger_profile_reinit;
extern char profile_filename[128];

ditto (prefix)


pal/src/host/linux-sgx/host_internal.h line 177 at r14 (raw file):

struct perf_data* pd_open(const char* file_name, bool with_stack);

ssize_t pd_close_file(struct perf_data* pd);

Can you group the new helper functions together? Like this:

int pd_open_file(struct perf_data* pd, const char* file_name);
ssize_t pd_close_file(struct perf_data* pd);

struct perf_data* pd_open(const char* file_name, bool with_stack);
...

pal/src/host/linux-sgx/host_main.c line 780 at r14 (raw file):

#ifdef DEBUG
    enclave_info->profile_enable = false;
    profile_filename[0] = '\0';

I think you can safely remove this line, it's not needed. (The global variable that you declare is already initialized to all zeros by the compiler.)


pal/src/host/linux-sgx/host_perf_data.c line 144 at r14 (raw file):

int pd_open_file(struct perf_data* pd, const char* file_name) {
    int ret, close_ret;

Please move close_ret to where it is used:

fail:
    int close_ret = DO_SYSCALL(close, fd);
    ...

It's just easier to read that way -- there's only one place where it is used, so it doesn't have to be declared at the top of the func.


pal/src/host/linux-sgx/host_perf_data.c line 148 at r14 (raw file):

    int fd = DO_SYSCALL(open, file_name, O_WRONLY | O_TRUNC | O_CREAT | O_CLOEXEC, PERM_rw_r__r__);
    if (fd < 0) {
        log_error("pd_open: cannot open %s for writing: %s", file_name, unix_strerror(fd));

pd_open -> pd_open_file


pal/src/host/linux-sgx/host_perf_data.c line 149 at r14 (raw file):

    if (fd < 0) {
        log_error("pd_open: cannot open %s for writing: %s", file_name, unix_strerror(fd));
        return -EINVAL;

You can just return fd;


pal/src/host/linux-sgx/host_perf_data.c line 171 at r14 (raw file):

    close_ret = DO_SYSCALL(close, fd);
    if (close_ret < 0)
        log_error("pd_reopen: close failed: %s", unix_strerror(close_ret));

pd_reopen -> pd_open_file


pal/src/host/linux-sgx/host_perf_data.c line 176 at r14 (raw file):

struct perf_data* pd_open(const char* file_name, bool with_stack) {
    int ret;

Please move ret directly to where it's used:

    int ret = pd_open_file(pd, file_name);
    ...

pal/src/host/linux-sgx/host_perf_data.c line 182 at r14 (raw file):

        log_error("pd_open: out of memory");
        return NULL;
    }

Please add an empty line after this one


pal/src/host/linux-sgx/host_perf_data.c line 185 at r14 (raw file):

    ret = pd_open_file(pd, file_name);
    if (ret < 0) {
        log_error("pd_open: close failed: %s", unix_strerror(ret));

The error message is wrong. I think you can just remove this line, because pd_open_file() already prints an error message.


pal/src/host/linux-sgx/host_perf_data.c line 186 at r14 (raw file):

    if (ret < 0) {
        log_error("pd_open: close failed: %s", unix_strerror(ret));
        return NULL;

You forgot to free(pd); before returning


pal/src/host/linux-sgx/host_perf_data.c line 282 at r14 (raw file):

ssize_t pd_close_file(struct perf_data* pd) {
    ssize_t ret = 0;

Please remove = 0. Because this variable will be re-assigned anyway in the next line. I don't know why it was here in the first place.


pal/src/host/linux-sgx/host_perf_data.c line 283 at r14 (raw file):

ssize_t pd_close_file(struct perf_data* pd) {
    ssize_t ret = 0;
    int close_ret;

ditto (move to where it's used)


pal/src/host/linux-sgx/host_perf_data.c line 300 at r14 (raw file):

    close_ret = DO_SYSCALL(close, pd->fd);
    if (close_ret < 0)
        log_error("pd_close: close failed: %s", unix_strerror(close_ret));

pf_close_file:


pal/src/host/linux-sgx/host_perf_data.c line 302 at r14 (raw file):

        log_error("pd_close: close failed: %s", unix_strerror(close_ret));

    return ret;

It is a non-trivial property of this function that it returns "size of the finalized perf-data file" (the return value of lseek() syscall above).

Please add a comment:

/* returns the size of the finalized perf-data file on success */
return ret;

pal/src/host/linux-sgx/host_profile.c line 201 at r14 (raw file):

}

/* Re-initialize based on g_pal_enclave settings */

Actually, I don't understand this comment. I think the function name is self-explanatory, and this comment can be deleted.


pal/src/host/linux-sgx/host_profile.c line 210 at r14 (raw file):

    size = pd_close_file(g_perf_data);
    if (size < 0) {
        log_error("sgx_profile_reinit: pd_close failed: %s", unix_strerror(size));

pd_close_file failed


pal/src/host/linux-sgx/host_profile.c line 211 at r14 (raw file):

    if (size < 0) {
        log_error("sgx_profile_reinit: pd_close failed: %s", unix_strerror(size));
        return -size;

Why negation? Should be just return size;

Copy link
Contributor Author

@jkr0103 jkr0103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 7 files reviewed, 23 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)


pal/src/host/linux-sgx/host_internal.h line 59 at r14 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Please add g_ prefix -- all global (non-function-scope) variables must have this prefix.

Done.


pal/src/host/linux-sgx/host_internal.h line 60 at r14 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

ditto (prefix)

Done.


pal/src/host/linux-sgx/host_internal.h line 177 at r14 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Can you group the new helper functions together? Like this:

int pd_open_file(struct perf_data* pd, const char* file_name);
ssize_t pd_close_file(struct perf_data* pd);

struct perf_data* pd_open(const char* file_name, bool with_stack);
...

Done.


pal/src/host/linux-sgx/host_main.c line 780 at r14 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I think you can safely remove this line, it's not needed. (The global variable that you declare is already initialized to all zeros by the compiler.)

Done.


pal/src/host/linux-sgx/host_perf_data.c line 148 at r14 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

pd_open -> pd_open_file

Done.


pal/src/host/linux-sgx/host_perf_data.c line 149 at r14 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

You can just return fd;

Done.


pal/src/host/linux-sgx/host_perf_data.c line 171 at r14 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

pd_reopen -> pd_open_file

Done.


pal/src/host/linux-sgx/host_perf_data.c line 176 at r14 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Please move ret directly to where it's used:

    int ret = pd_open_file(pd, file_name);
    ...

Done.


pal/src/host/linux-sgx/host_perf_data.c line 182 at r14 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Please add an empty line after this one

Done.


pal/src/host/linux-sgx/host_perf_data.c line 185 at r14 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

The error message is wrong. I think you can just remove this line, because pd_open_file() already prints an error message.

Done.


pal/src/host/linux-sgx/host_perf_data.c line 186 at r14 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

You forgot to free(pd); before returning

Done.


pal/src/host/linux-sgx/host_perf_data.c line 282 at r14 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Please remove = 0. Because this variable will be re-assigned anyway in the next line. I don't know why it was here in the first place.

Done.


pal/src/host/linux-sgx/host_perf_data.c line 283 at r14 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

ditto (move to where it's used)

Done.


pal/src/host/linux-sgx/host_perf_data.c line 300 at r14 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

pf_close_file:

Done.


pal/src/host/linux-sgx/host_perf_data.c line 302 at r14 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

It is a non-trivial property of this function that it returns "size of the finalized perf-data file" (the return value of lseek() syscall above).

Please add a comment:

/* returns the size of the finalized perf-data file on success */
return ret;

Done.


pal/src/host/linux-sgx/host_profile.c line 201 at r14 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Actually, I don't understand this comment. I think the function name is self-explanatory, and this comment can be deleted.

Done.


pal/src/host/linux-sgx/host_profile.c line 210 at r14 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

pd_close_file failed

Done.


pal/src/host/linux-sgx/host_profile.c line 211 at r14 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Why negation? Should be just return size;

Done.

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r15, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @jkr0103 and @mkow)


pal/src/host/linux-sgx/host_internal.h line 177 at r14 (raw file):

Previously, jkr0103 (Jitender Kumar) wrote…

Done.

Sorry, that's not what I suggested.

Copy link
Contributor Author

@jkr0103 jkr0103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 6 of 7 files reviewed, 5 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)


pal/src/host/linux-sgx/host_internal.h line 177 at r14 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Sorry, that's not what I suggested.

Done.

@dimakuv dimakuv requested a review from mkow June 11, 2024 08:43
Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r16, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @jkr0103 and @mkow)

a discussion (no related file):
I will test this PR when other reviewers also approve and before the merge. Otherwise looks good to me.


Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 5 files at r15, 1 of 1 files at r16, all commit messages.
Reviewable status: all files reviewed, 13 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @jkr0103)


pal/src/host/linux-sgx/host_internal.h line 59 at r16 (raw file):

extern struct pal_enclave g_pal_enclave;
extern bool g_trigger_profile_reinit;

These two should be in #ifdef DEBUG.


pal/src/host/linux-sgx/host_internal.h line 176 at r16 (raw file):

ssize_t pd_close_file(struct perf_data* pd);

struct perf_data* pd_open(const char* file_name, bool with_stack);

I'd list this function first, the order would be more logical then.


pal/src/host/linux-sgx/host_perf_data.c line 297 at r11 (raw file):

Previously, jkr0103 (Jitender Kumar) wrote…

Dmitrii and I have seen with memcached worklaod where on thread was trying to write but FD was already close, that's why we introduced dup2 which does two operations atomically close and swap the value of old to new FD.

You are right on free, hence removed free.

This hack was deleted, resolving.


pal/src/host/linux-sgx/host_perf_data.c line 300 at r11 (raw file):

Previously, jkr0103 (Jitender Kumar) wrote…

We repalced close with dup2 to workaround the race condition. Please see above comments for more info on race condition.

This hack was deleted, resolving.


pal/src/host/linux-sgx/host_perf_data.c line 165 at r16 (raw file):

        goto fail;

    pd->fd = fd;

assert(pd->fd == -1) before overwriting this


pal/src/host/linux-sgx/host_perf_data.c line 168 at r16 (raw file):

    return 0;

fail: ;

Suggestion:

fail:;

pal/src/host/linux-sgx/host_perf_data.c line 299 at r16 (raw file):

    if (close_ret < 0)
        log_error("pd_close_file: close failed: %s", unix_strerror(close_ret));

Just to be sure, please set pd->fd to -1.


pal/src/host/linux-sgx/host_profile.c line 197 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

You're right, that's a good point.

Then please make it log_error or log_warning. log_always prints even if the user set loader.log_level = "none".


pal/src/host/linux-sgx/host_profile.c line 39 at r16 (raw file):

#define NSEC_IN_SEC 1000000000

/* Filenames for saved data */

Suggestion:

/* Filename for saved data */

pal/src/host/linux-sgx/host_profile.c line 156 at r16 (raw file):

out:
    if (g_mem_fd > 0) {

Suggestion:

if (g_mem_fd >= 0) {

pal/src/host/linux-sgx/host_profile.c line 213 at r16 (raw file):

    }

    log_always("Profile data written to %s (%lu bytes)", g_profile_filename, size);

ditto


pal/src/host/linux-sgx/host_profile.c line 246 at r16 (raw file):

    spinlock_lock(&g_perf_data_lock);
    if (__atomic_exchange_n(&g_trigger_profile_reinit, false, __ATOMIC_ACQ_REL) == true)
        sgx_profile_reinit();

missing result check


pal/src/host/linux-sgx/host_profile.c line 272 at r16 (raw file):

    spinlock_lock(&g_perf_data_lock);
    if (__atomic_exchange_n(&g_trigger_profile_reinit, false, __ATOMIC_ACQ_REL) == true)
        sgx_profile_reinit();

missing result check

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 13 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @jkr0103 and @mkow)


pal/src/host/linux-sgx/host_internal.h line 176 at r16 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

I'd list this function first, the order would be more logical then.

It was me who wanted the order differently, but Ok, both Michal and Jitender think it's better to reorder. I am Ok with this.


pal/src/host/linux-sgx/host_profile.c line 197 at r1 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Then please make it log_error or log_warning. log_always prints even if the user set loader.log_level = "none".

One issue with log_error and log_warning is that they will prepend error: or warning: words to the string. It's a bit misleading, as there is nothing wrong with the execution. It's more like info: , but we currently don't have this convenience function. Maybe we should add it.

Anyway, I think we can go with log_error() for now, and then add log_info() in another PR.


pal/src/host/linux-sgx/host_profile.c line 246 at r16 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

missing result check

Yeah, since we're in the exception handler and there's no reason to terminate the whole application, I think we can just print log_error(), see several lines below for how we treat errors in pd_event_sample_simple().

Copy link
Contributor Author

@jkr0103 jkr0103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 13 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)


pal/src/host/linux-sgx/host_profile.c line 197 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

One issue with log_error and log_warning is that they will prepend error: or warning: words to the string. It's a bit misleading, as there is nothing wrong with the execution. It's more like info: , but we currently don't have this convenience function. Maybe we should add it.

Anyway, I think we can go with log_error() for now, and then add log_info() in another PR.

This log is printed in non-production case only and that too when a file is being create and this info must be passed to the user. We still can have it even if user set loader.log_level = "none".

If you still disagree, I'll make the changes.


pal/src/host/linux-sgx/host_profile.c line 246 at r16 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Yeah, since we're in the exception handler and there's no reason to terminate the whole application, I think we can just print log_error(), see several lines below for how we treat errors in pd_event_sample_simple().

we already print error message inside sgx_profile_reinit in all errors cases, do we really want to print error message again?

Copy link
Contributor Author

@jkr0103 jkr0103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 7 files reviewed, 13 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)


pal/src/host/linux-sgx/host_internal.h line 59 at r16 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

These two should be in #ifdef DEBUG.

Done.


pal/src/host/linux-sgx/host_internal.h line 176 at r16 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

It was me who wanted the order differently, but Ok, both Michal and Jitender think it's better to reorder. I am Ok with this.

Done.


pal/src/host/linux-sgx/host_perf_data.c line 165 at r16 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

assert(pd->fd == -1) before overwriting this

Done.


pal/src/host/linux-sgx/host_perf_data.c line 168 at r16 (raw file):

    return 0;

fail: ;

Done.


pal/src/host/linux-sgx/host_perf_data.c line 299 at r16 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Just to be sure, please set pd->fd to -1.

Done.


pal/src/host/linux-sgx/host_profile.c line 39 at r16 (raw file):

#define NSEC_IN_SEC 1000000000

/* Filenames for saved data */

Done.


pal/src/host/linux-sgx/host_profile.c line 156 at r16 (raw file):

out:
    if (g_mem_fd > 0) {

Done.


pal/src/host/linux-sgx/host_profile.c line 213 at r16 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

ditto

ditto


pal/src/host/linux-sgx/host_profile.c line 246 at r16 (raw file):

Previously, jkr0103 (Jitender Kumar) wrote…

we already print error message inside sgx_profile_reinit in all errors cases, do we really want to print error message again?

Added a comment before calling sgx_profile_reinit.


pal/src/host/linux-sgx/host_profile.c line 272 at r16 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

missing result check

Added a comment before calling sgx_profile_reinit.

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r17, all commit messages.
Reviewable status: all files reviewed, 13 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @jkr0103 and @mkow)

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r17, all commit messages.
Reviewable status: all files reviewed, 9 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @jkr0103)


pal/src/host/linux-sgx/host_internal.h line 176 at r16 (raw file):

Previously, jkr0103 (Jitender Kumar) wrote…

Done.

But not like that - pd_open() is the function creating perf_data object and it's always called first, it should be listed before pd_open_file.


pal/src/host/linux-sgx/host_perf_data.c line 301 at r17 (raw file):

        log_error("pd_close_file: close failed: %s", unix_strerror(close_ret));

    pd->fd = -1;

Should be before if, even a failed close() makes the descriptor unusable (see the corresponding manpage).


pal/src/host/linux-sgx/host_profile.c line 197 at r1 (raw file):

this info must be passed to the user

It doesn't have to, it's just some useful debug information.

We still can have it even if user set loader.log_level = "none".

Yeah, and that's the very problem - it breaks the explicit user choice to stop Gramine from printing logs. All other places in our codebase respect this settings (except places before log_level is initialized).

I'm not sure what's the best solution here, I'd probably just go with log_debug or log_error with added [INFO].


pal/src/host/linux-sgx/host_profile.c line 246 at r16 (raw file):
But we cannot continue just ignoring the failure, the whole profiling state is inconsistent?

there's no reason to terminate the whole application

I don't agree, the user wanted to perf-trace the whole app, doing that completely failed and is unusable after this point. I don't see why a user would want to continue execution in this state.


pal/src/host/linux-sgx/host_profile.c line 247 at r17 (raw file):

    if (__atomic_exchange_n(&g_trigger_profile_reinit, false, __ATOMIC_ACQ_REL) == true) {
        /* No error checking since we're in exception handler and the function already prints
           errors */

Suggestion:

 * errors */

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 9 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @jkr0103 and @mkow)


pal/src/host/linux-sgx/host_profile.c line 197 at r1 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

this info must be passed to the user

It doesn't have to, it's just some useful debug information.

We still can have it even if user set loader.log_level = "none".

Yeah, and that's the very problem - it breaks the explicit user choice to stop Gramine from printing logs. All other places in our codebase respect this settings (except places before log_level is initialized).

I'm not sure what's the best solution here, I'd probably just go with log_debug or log_error with added [INFO].

Ok, then I suggest to follow @mkow's advice and do:

    log_error("[INFO] Profile data written to %s (%lu bytes)", g_profile_filename, size);

pal/src/host/linux-sgx/host_profile.c line 246 at r16 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

But we cannot continue just ignoring the failure, the whole profiling state is inconsistent?

there's no reason to terminate the whole application

I don't agree, the user wanted to perf-trace the whole app, doing that completely failed and is unusable after this point. I don't see why a user would want to continue execution in this state.

Hm, actually @mkow is right. The perf becomes unusable. Let's do this forceful termination then:

    if (__atomic_exchange_n(&g_trigger_profile_reinit, false, __ATOMIC_ACQ_REL) == true) {
        ret = sgx_profile_reinit();
        if (ret < 0) {
            /* no need to print error message since sgx_profile_init() already prints it */
            BUG();
        }
    }

Copy link
Contributor Author

@jkr0103 jkr0103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 7 files reviewed, 9 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)


pal/src/host/linux-sgx/host_internal.h line 176 at r16 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

But not like that - pd_open() is the function creating perf_data object and it's always called first, it should be listed before pd_open_file.

Usually callee is added before caller, hence made the change accordingly. Anyways I updated based on the comments.


pal/src/host/linux-sgx/host_perf_data.c line 301 at r17 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Should be before if, even a failed close() makes the descriptor unusable (see the corresponding manpage).

Done.


pal/src/host/linux-sgx/host_profile.c line 197 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Ok, then I suggest to follow @mkow's advice and do:

    log_error("[INFO] Profile data written to %s (%lu bytes)", g_profile_filename, size);

Done.


pal/src/host/linux-sgx/host_profile.c line 213 at r16 (raw file):

Previously, jkr0103 (Jitender Kumar) wrote…

ditto

Done


pal/src/host/linux-sgx/host_profile.c line 246 at r16 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Hm, actually @mkow is right. The perf becomes unusable. Let's do this forceful termination then:

    if (__atomic_exchange_n(&g_trigger_profile_reinit, false, __ATOMIC_ACQ_REL) == true) {
        ret = sgx_profile_reinit();
        if (ret < 0) {
            /* no need to print error message since sgx_profile_init() already prints it */
            BUG();
        }
    }

Done.


pal/src/host/linux-sgx/host_profile.c line 247 at r17 (raw file):

    if (__atomic_exchange_n(&g_trigger_profile_reinit, false, __ATOMIC_ACQ_REL) == true) {
        /* No error checking since we're in exception handler and the function already prints
           errors */

Done.

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r18, all commit messages.
Reviewable status: all files reviewed, 9 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @mkow)

@mkow mkow requested a review from dimakuv June 25, 2024 21:19
mkow
mkow previously approved these changes Jun 25, 2024
Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r18, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: ITL), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @jkr0103)


pal/src/host/linux-sgx/host_internal.h line 176 at r16 (raw file):

Previously, jkr0103 (Jitender Kumar) wrote…

Usually callee is added before caller, hence made the change accordingly. Anyways I updated based on the comments.

That's mostly for static functions I think, because C is old and annoying, and you either need a forward declaration or declare all the functions in their dependency order. But in the headers I prefer a more logical grouping, so that it's easier to read by the function's user.

@jkr0103 jkr0103 force-pushed the jkr0103/perf_record_reset branch 2 times, most recently from bda60ac to 4e2a7df Compare June 26, 2024 05:26
This allows to collect profile data only for a particular period (e.g.,
during Gramine startup or during an actual benchmark run).

Signed-off-by: jkr0103 <jitender.kumar@intel.com>
Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r19, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @mkow)

a discussion (no related file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I will test this PR when other reviewers also approve and before the merge. Otherwise looks good to me.

Tested on Redis and Memcached. Things look good to me.

Gotcha: If there's nothing to process, all Memcached threads go into sleep waiting on epoll_wait(<forever>). This means that a SIGUSR1 signal will set the flag to g_trigger_profile_reinit, but the actual write-to-perf-file will happen only on the next SGX enclave exit, which will happen only after epoll_wait() returns, i.e. on a next send/recv request from the client (or on SIGTERM). I don't consider it a problem, as this is anyway what we want to measure.



-- commits line 2 at r4:

Previously, mkow (Michał Kowalczyk) wrote…

(and also, Pal -> PAL)

Done.

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r11, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @mkow)

@dimakuv
Copy link
Contributor

dimakuv commented Jun 26, 2024

Jenkins, test this please

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r19, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved

@dimakuv dimakuv merged commit 6d77bcf into gramineproject:master Jun 26, 2024
17 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants