Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thin DS: Command Line Interface #11549

Merged
merged 12 commits into from
Feb 21, 2024
Merged

Conversation

Honny1
Copy link
Collaborator

@Honny1 Honny1 commented Feb 6, 2024

Description:

This PR introduces a new flag for generating thin data streams. The --rule-id flag generates a thin data stream for a single rule. The thin data stream for a single rule generated by the --rule-id flag is reduced from 19 MB to approximately 890 KB. Also, the --thin flag is created to generate thin data streams for each rule in the rules in benchmark relevant to the product. The rules use default variables.

Further reduction of CPE, XCCDF, etc. will be done in future PRs. Also, the documentation of the changes will be created in another PR.

A side effect of this PR is automatic XML formatting using Python3+. So in another PR, the XML formatting step can be removed from cmake.

--rule-id RULE_ID

This flag builds a single Datastream as in the case of thick Datastreams but as a thin Datastream with only one rule.

--thin

This flag builds thin Datastreams for each rule relevant to the benchmark. The components of each rule, such as XCCDF, OVAL, and OCIL, are stored in the build directory in ./build/<PRODUCT>/thin_ds_components as files with prefixes for specific component types and the rule id. The final Thin Datastream is stored in the build directory under ./build/thin_ds/. In this case, you disabled the Datastream build version 1. 2.

Review Hints:

To test the --rule-id flag, you can run this command:

./build_product fedora --rule-id audit_rules_privileged_commands_fusermount3

To test the --thin flag, you can run this script:

#!/bin/bash

./build_product fedora --thin

for filename in ./build/thin_ds/*.xml; do
    echo "$filename"
    oscap xccdf eval  --profile "(all)" --results-arf "arf_results/arf_$(basename -- $filename)" "$filename"
done

The script generates a thin Datastream for each rule and then performs a scan using oscap.
This test takes more than an hour to run because there are approximately 1830 rules to process and some rules are very memory intensive.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Used by openshift-ci bot. label Feb 6, 2024
Copy link

openshift-ci bot commented Feb 6, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link

github-actions bot commented Feb 6, 2024

Start a new ephemeral environment with changes proposed in this pull request:

Fedora Environment
Open in Gitpod

Oracle Linux 8 Environment
Open in Gitpod

@Honny1 Honny1 force-pushed the thin-ds-cli branch 3 times, most recently from cd4904d to f2d31c0 Compare February 6, 2024 18:21
Copy link
Collaborator

@jan-cerny jan-cerny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great thing!

I tried to run the test scenarios for rule audit_rules_privileged_commands and using the full data stream it took 5:12 but using the thin data stream built from this PR it took only 3:18. That's a big saving.

In general, I suggest focusing now on getting this merge-able instead of focusing on further reduction of the built data stream. Further reduction of the built data stream can be done in following separate PRs.

@@ -118,7 +118,7 @@ macro(ssg_build_compiled_artifacts PRODUCT)
add_custom_command(
OUTPUT "${CMAKE_CURRENT_BINARY_DIR}/ssg_build_compile_all-${PRODUCT}"
COMMAND ${CMAKE_COMMAND} -E make_directory "${CMAKE_CURRENT_BINARY_DIR}/profiles"
COMMAND env "PYTHONPATH=$ENV{PYTHONPATH}" "${PYTHON_EXECUTABLE}" "${SSG_BUILD_SCRIPTS}/compile_all.py" --resolved-base "${CMAKE_CURRENT_BINARY_DIR}" --project-root "${CMAKE_SOURCE_DIR}" --build-config-yaml "${CMAKE_BINARY_DIR}/build_config.yml" --product-yaml "${CMAKE_CURRENT_BINARY_DIR}/product.yml" --sce-metadata "${CMAKE_CURRENT_BINARY_DIR}/checks/sce/metadata.json"
COMMAND env "PYTHONPATH=$ENV{PYTHONPATH}" "${PYTHON_EXECUTABLE}" "${SSG_BUILD_SCRIPTS}/compile_all.py" --resolved-base "${CMAKE_CURRENT_BINARY_DIR}" --project-root "${CMAKE_SOURCE_DIR}" --build-config-yaml "${CMAKE_BINARY_DIR}/build_config.yml" --product-yaml "${CMAKE_CURRENT_BINARY_DIR}/product.yml" --sce-metadata "${CMAKE_CURRENT_BINARY_DIR}/checks/sce/metadata.json" "${SSG_THIN_DS_ARG}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that it's sometimes used here but I don't like this idea of including the option into the variable. Instead I prefer to write the option explicitly, because that would be consistent with the other options here.

Suggested change
COMMAND env "PYTHONPATH=$ENV{PYTHONPATH}" "${PYTHON_EXECUTABLE}" "${SSG_BUILD_SCRIPTS}/compile_all.py" --resolved-base "${CMAKE_CURRENT_BINARY_DIR}" --project-root "${CMAKE_SOURCE_DIR}" --build-config-yaml "${CMAKE_BINARY_DIR}/build_config.yml" --product-yaml "${CMAKE_CURRENT_BINARY_DIR}/product.yml" --sce-metadata "${CMAKE_CURRENT_BINARY_DIR}/checks/sce/metadata.json" "${SSG_THIN_DS_ARG}"
COMMAND env "PYTHONPATH=$ENV{PYTHONPATH}" "${PYTHON_EXECUTABLE}" "${SSG_BUILD_SCRIPTS}/compile_all.py" --resolved-base "${CMAKE_CURRENT_BINARY_DIR}" --project-root "${CMAKE_SOURCE_DIR}" --build-config-yaml "${CMAKE_BINARY_DIR}/build_config.yml" --product-yaml "${CMAKE_CURRENT_BINARY_DIR}/product.yml" --sce-metadata "${CMAKE_CURRENT_BINARY_DIR}/checks/sce/metadata.json" --rule-id "${SSG_THIN_DS_RULE_ID}"

@jan-cerny jan-cerny self-assigned this Feb 7, 2024
@jan-cerny jan-cerny added this to the 0.1.73 milestone Feb 7, 2024
@Honny1 Honny1 force-pushed the thin-ds-cli branch 8 times, most recently from efd1aa4 to e5fdf78 Compare February 12, 2024 16:43
@openshift-merge-robot openshift-merge-robot added the needs-rebase Used by openshift-ci bot. label Feb 13, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Used by openshift-ci bot. label Feb 13, 2024
@Honny1 Honny1 added the Infrastructure Our content build system label Feb 15, 2024
@Honny1 Honny1 marked this pull request as ready for review February 15, 2024 14:07
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Used by openshift-ci bot. label Feb 15, 2024
@comps
Copy link
Collaborator

comps commented Feb 15, 2024

Seems to work for me - at least oscap xccdf eval consumes it and reports a result.

The speed up in calling oscap is nice, even if the content build time takes a lot longer than a "fat" datastream (when building all rules for a product).

$ time oscap xccdf eval --profile '(all)' --rule xccdf_org.ssgproject.content_rule_umask_for_daemons /usr/share/xml/scap/ssg/content/ssg-rhel9-ds.xml
real    0m1.426s
user    0m1.282s
sys     0m0.135s

$ time oscap xccdf eval --profile '(all)' --rule xccdf_org.ssgproject.content_rule_umask_for_daemons ssg-rhel9-ds_umask_for_daemons.xml
real    0m0.178s
user    0m0.161s
sys     0m0.013s

(Some random rule I saw in the list.)

@Mab879
Copy link
Member

Mab879 commented Feb 15, 2024

I tried build the firefox product with --thin and the build failed. I got similar results with the chromium product.

  File "/home/mburket/Developer/ComplianceAsCode/content/build-scripts/build_xccdf.py", line 150, in <module>
    main()
  File "/home/mburket/Developer/ComplianceAsCode/content/build-scripts/build_xccdf.py", line 146, in main
    link_benchmark(loader, xccdftree, args)
  File "/home/mburket/Developer/ComplianceAsCode/content/build-scripts/build_xccdf.py", line 93, in link_benchmark
    link_oval(xccdftree, checks, args.oval, args.build_ovals_dir)
  File "/home/mburket/Developer/ComplianceAsCode/content/build-scripts/build_xccdf.py", line 75, in link_oval
    oval_linker.link()
  File "/home/mburket/Developer/ComplianceAsCode/content/ssg/build_renumber.py", line 171, in link
    self.oval_document = load_oval_document(parse_file(self.fname))
                                            ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mburket/Developer/ComplianceAsCode/content/ssg/xml.py", line 84, in parse_file
    tree = open_xml(filename)
           ^^^^^^^^^^^^^^^^^^
  File "/home/mburket/Developer/ComplianceAsCode/content/ssg/xml.py", line 77, in open_xml
    return ElementTree.parse(filename)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/xml/etree/ElementTree.py", line 1203, in parse
    tree.parse(source, parser)
  File "/usr/lib64/python3.12/xml/etree/ElementTree.py", line 557, in parse
    source = open(source, "rb")
             ^^^^^^^^^^^^^^^^^^
TypeError: expected str, bytes or os.Path

if rule_id not in rules and "ALL_RULES" not in rule_id:
raise Exception("Rule ID: {} not found!".format(rule_id))
if "ALL_RULES" not in rule_id:
rules = set(rule_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong. It will create a set of letters instead of a set containing a single item.

jcerny@fedora:~/work/git/scap-security-guide (pr/11549)$ python3
Python 3.12.1 (main, Dec 18 2023, 00:00:00) [GCC 13.2.1 20231205 (Red Hat 13.2.1-6)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> set("ahoj")
{'o', 'j', 'a', 'h'}
>>> 

def get_profiles_per_rule_by_id(rule_id, project_root_abspath, env_yaml, product_yaml):
relevant_benchmarks = get_relevant_benchmarks(env_yaml, product_yaml)
rules = find_existing_rules(project_root_abspath, relevant_benchmarks)
if rule_id not in rules and "ALL_RULES" not in rule_id:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This special value needs to be mentioned in the help text of the --rule-id option.

@@ -158,8 +205,14 @@ def main():
controls_manager.remove_selections_not_known(loader.all_rules)
controls_manager.add_references(loader.all_rules)

profiles_by_id = get_all_resolved_profiles_by_id(
env_yaml, product_yaml, loader, product_cpes, controls_manager, controls_dir)
if args.rule_id is None or args.rule_id == "off":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The special value needs to be described in the help text of the --rule-id option.

for rule in rules:
data = {
'documentation_complete': True,
'variables': variables,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'variables': variables,
'variables': variables,
'hidden': True,

I suggest creating a hidden profile so that this fake profile won't appear in the built SCAP data stream.

)
parser.add_argument(
"--per-profile-dir",
help="Directory to store XCCDF, OVAL, OCIL, per profile. (off: to disable)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per profile? I'm confused. It seems like they're per rule. Can you clarify?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole process of creating a thin datastream is that a profile is generated for each rule. (In case only one rule is specified is created one profile) This is done in the compile_all.py file. When the --thin flag is set, XCCDF, OVAL and OCIL components need to be generated for each rule. Since each rule has its own profile the generation is done by profiles. (In case the one rule is specified --per-profile-dir flag is disabled). Also, this can be used sometime in the future when we want to build a datastream with only one profile.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so in fact the build system creates fake profiles for each rules and then build_xccdf.py simply creates content driven by each of the fake profiles.

You're right that naming this a "--per-profile-dir" is technically correct because that's the way how things work. However, I'm afraid that it might create more confusion given the way how it's used. If someone takes a look at what is generated into this directory they might not realize that it's internally implemented using the fake profiles.

I suggest renaming this to --thin-ds-components-dir or similar. This name would better describe the actual usage of this option. Moreover, it will still be correct if you consider that a data stream with just a single profile is also a thin data stream.

I suggest renaming also the SSG_PER_PROFILE_COMPONENT_DIR CMake variable in a similar way.

@@ -93,6 +94,8 @@ print_help()
printf '\t%s\n' "--derivatives, --no-derivatives: Also build derivatives of products if applicable (off by default)"
printf '\t%s\n' "--ansible-playbooks, --no-ansible-playbooks: Build Ansible Playbooks for every profile (on by default)"
printf '\t%s\n' "--bash-scripts, --no-bash-scripts: Build Bash remediation scripts for every profile (on by default)"
printf '\t%s\n' "-t, --thin, --no-thin: Build thin data streams for each rule. Do not build any of the guides, tables, etc (off by default)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This takes a horrible amount of time. Do you have any idea why it could take so long?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it generates about 1800 thin ds.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try some products with smaller benchmark. (For example firefox)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it generates about 1800 thin ds.

Obviously.

But I was more interested if you know in which part it spends the most time and what slows this down the most. Is it composing the data streams from individual components? Processing compiled rules? Something else?

In theory it should be possible to make the difference between building a single large data stream or many small data streams negligible because these outputs differ only structuring of the output data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generating the XCCDF benchmark on a profile and linking it with other components causes the biggest slowdown. I'm working on speeding it up right now. Should I incorporate the changes into this PR or create another one?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose doing that in a separate PR. This PR is already large enough.

@@ -751,6 +751,10 @@ macro(ssg_build_product PRODUCT)
set(PRODUCT_${_LANGUAGE}_REMEDIATION_ENABLED TRUE)
endforeach()

set(SSG_PER_PROFILE_COMPONET_DIR "off")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COMPONET

p = Paths_(
xccdf=get_path(args.per_profile_dir, "xccdf_{}.xml".format(id_)),
oval=get_path(args.per_profile_dir, "oval_{}.xml".format(id_)),
ocil=get_path(args.per_profile_dir, "ocil_{}.xml".format(id_)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not include OCIL at all?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the rule does not contain OCIL. OCIL does not appear in the final ds.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'd keep it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if the rule contains OCIL, it will appear in the final DS 😄

My point is that the OCIL is useless for the use cases of the thin data stream, so I wouldn't include it at all.

(Side note: our OCIL is useless for all purposes but we need to keep it in the normal DS because of compatibility)

Would it be easy or difficult to not add OCILs to thin DSs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's simple it's just a few lines of code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Honny1
Copy link
Collaborator Author

Honny1 commented Feb 16, 2024

@Mab879 The problem you found should be fixed.

@Mab879
Copy link
Member

Mab879 commented Feb 16, 2024

@Mab879 The problem you found should be fixed.

Confirmed, both the firefox and chromium products build now with -t.

Thanks!

Copy link
Collaborator

@jan-cerny jan-cerny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The outputs look good. We can solve further content size reduction in separate PRs.

Please see my comments.

CMakeLists.txt Outdated
@@ -275,6 +279,7 @@ message(STATUS "Separate SCAP files: ${SSG_SEPARATE_SCAP_FILES_ENABLED}")
message(STATUS "Ansible Playbooks: ${SSG_ANSIBLE_PLAYBOOKS_ENABLED}")
message(STATUS "Ansible Playbooks Per Rule: ${SSG_ANSIBLE_PLAYBOOKS_PER_RULE_ENABLED}")
message(STATUS "Bash scripts: ${SSG_BASH_SCRIPTS_ENABLED}")
message(STATUS "Thin Datastreams: ${SSG_THIN_DS}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
message(STATUS "Thin Datastreams: ${SSG_THIN_DS}")
message(STATUS "Thin data streams: ${SSG_THIN_DS}")

"--rule-id",
type=str,
help="Creates a profile with the specified rule and does not use other profiles."
"It is named in the RULE_ID format. If you want to process "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest something like: The profile ID is identical to the rule ID of the selected rule.

if self.benchmark is None:
raise Exception(
"Before generating benchmarks for each profile, you need to load "
"the initial benchmark. Using the load_benchmark method."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"the initial benchmark. Using the load_benchmark method."
"the initial benchmark using the load_benchmark method."

Copy link

🤖 A k8s content image for this PR is available at:
ghcr.io/complianceascode/k8scontent:11549

Click here to see how to deploy it

If you alread have Compliance Operator deployed:
utils/build_ds_container.py -i ghcr.io/complianceascode/k8scontent:11549

Otherwise deploy the content and operator together by checking out ComplianceAsCode/compliance-operator and:
CONTENT_IMAGE=ghcr.io/complianceascode/k8scontent:11549 make deploy-local

Copy link

codeclimate bot commented Feb 20, 2024

Code Climate has analyzed commit 6b4b49f and detected 0 issues on this pull request.

The test coverage on the diff in this pull request is 13.7% (50% is the threshold).

This pull request will bring the total coverage in the repository to 58.1% (-0.1% change).

View more on Code Climate.

@jan-cerny
Copy link
Collaborator

/packit retest-failed

1 similar comment
@jan-cerny
Copy link
Collaborator

/packit retest-failed

Copy link
Collaborator

@jan-cerny jan-cerny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have generated thin DSs for some of my favourite rules. Then I tried to use them for a scan. It worked as expected.

Also, I have tried to generate thin DSs for all rules, this also worked.

At this moment the PR is good enough so I will merge it. We will continue in separate PRs to avoid bloating this PR and complicating reviews.

I suggest focusing on the following aspects in future PRs:

  • Improve awful performance of the -t option. If the performance will be improved, the feature will suddenly start to be useful for testing in downstream and CI.
  • Removing excess XCCDF Groups and Profiles that aren't essential
  • Integration test

Great job!

@jan-cerny
Copy link
Collaborator

/packit retest-failed

1 similar comment
@jan-cerny
Copy link
Collaborator

/packit retest-failed

@jan-cerny jan-cerny merged commit 8618a1d into ComplianceAsCode:master Feb 21, 2024
42 of 44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Infrastructure Our content build system
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants