Thin DS: Command Line Interface #11549

Honny1 · 2024-02-06T16:26:48Z

Description:

This PR introduces a new flag for generating thin data streams. The --rule-id flag generates a thin data stream for a single rule. The thin data stream for a single rule generated by the --rule-id flag is reduced from 19 MB to approximately 890 KB. Also, the --thin flag is created to generate thin data streams for each rule in the rules in benchmark relevant to the product. The rules use default variables.

Further reduction of CPE, XCCDF, etc. will be done in future PRs. Also, the documentation of the changes will be created in another PR.

A side effect of this PR is automatic XML formatting using Python3+. So in another PR, the XML formatting step can be removed from cmake.

`--rule-id` RULE_ID

This flag builds a single Datastream as in the case of thick Datastreams but as a thin Datastream with only one rule.

`--thin`

This flag builds thin Datastreams for each rule relevant to the benchmark. The components of each rule, such as XCCDF, OVAL, and OCIL, are stored in the build directory in ./build/<PRODUCT>/thin_ds_components as files with prefixes for specific component types and the rule id. The final Thin Datastream is stored in the build directory under ./build/thin_ds/. In this case, you disabled the Datastream build version 1. 2.

Review Hints:

To test the --rule-id flag, you can run this command:

./build_product fedora --rule-id audit_rules_privileged_commands_fusermount3

To test the --thin flag, you can run this script:

#!/bin/bash

./build_product fedora --thin

for filename in ./build/thin_ds/*.xml; do
    echo "$filename"
    oscap xccdf eval  --profile "(all)" --results-arf "arf_results/arf_$(basename -- $filename)" "$filename"
done

The script generates a thin Datastream for each rule and then performs a scan using oscap.
This test takes more than an hour to run because there are approximately 1830 rules to process and some rules are very memory intensive.

openshift-ci · 2024-02-06T16:26:59Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

github-actions · 2024-02-06T16:28:03Z

Start a new ephemeral environment with changes proposed in this pull request:

Fedora Environment

Oracle Linux 8 Environment

jan-cerny

This is a great thing!

I tried to run the test scenarios for rule audit_rules_privileged_commands and using the full data stream it took 5:12 but using the thin data stream built from this PR it took only 3:18. That's a big saving.

In general, I suggest focusing now on getting this merge-able instead of focusing on further reduction of the built data stream. Further reduction of the built data stream can be done in following separate PRs.

jan-cerny · 2024-02-07T10:17:57Z

cmake/SSGCommon.cmake

@@ -118,7 +118,7 @@ macro(ssg_build_compiled_artifacts PRODUCT)
        add_custom_command(
            OUTPUT "${CMAKE_CURRENT_BINARY_DIR}/ssg_build_compile_all-${PRODUCT}"
            COMMAND ${CMAKE_COMMAND} -E make_directory "${CMAKE_CURRENT_BINARY_DIR}/profiles"
-            COMMAND env "PYTHONPATH=$ENV{PYTHONPATH}" "${PYTHON_EXECUTABLE}" "${SSG_BUILD_SCRIPTS}/compile_all.py" --resolved-base "${CMAKE_CURRENT_BINARY_DIR}" --project-root "${CMAKE_SOURCE_DIR}" --build-config-yaml "${CMAKE_BINARY_DIR}/build_config.yml" --product-yaml "${CMAKE_CURRENT_BINARY_DIR}/product.yml" --sce-metadata "${CMAKE_CURRENT_BINARY_DIR}/checks/sce/metadata.json"
+            COMMAND env "PYTHONPATH=$ENV{PYTHONPATH}" "${PYTHON_EXECUTABLE}" "${SSG_BUILD_SCRIPTS}/compile_all.py" --resolved-base "${CMAKE_CURRENT_BINARY_DIR}" --project-root "${CMAKE_SOURCE_DIR}" --build-config-yaml "${CMAKE_BINARY_DIR}/build_config.yml" --product-yaml "${CMAKE_CURRENT_BINARY_DIR}/product.yml" --sce-metadata "${CMAKE_CURRENT_BINARY_DIR}/checks/sce/metadata.json" "${SSG_THIN_DS_ARG}"


I know that it's sometimes used here but I don't like this idea of including the option into the variable. Instead I prefer to write the option explicitly, because that would be consistent with the other options here.

Suggested change

COMMAND env "PYTHONPATH=$ENV{PYTHONPATH}" "${PYTHON_EXECUTABLE}" "${SSG_BUILD_SCRIPTS}/compile_all.py" --resolved-base "${CMAKE_CURRENT_BINARY_DIR}" --project-root "${CMAKE_SOURCE_DIR}" --build-config-yaml "${CMAKE_BINARY_DIR}/build_config.yml" --product-yaml "${CMAKE_CURRENT_BINARY_DIR}/product.yml" --sce-metadata "${CMAKE_CURRENT_BINARY_DIR}/checks/sce/metadata.json" "${SSG_THIN_DS_ARG}"

COMMAND env "PYTHONPATH=$ENV{PYTHONPATH}" "${PYTHON_EXECUTABLE}" "${SSG_BUILD_SCRIPTS}/compile_all.py" --resolved-base "${CMAKE_CURRENT_BINARY_DIR}" --project-root "${CMAKE_SOURCE_DIR}" --build-config-yaml "${CMAKE_BINARY_DIR}/build_config.yml" --product-yaml "${CMAKE_CURRENT_BINARY_DIR}/product.yml" --sce-metadata "${CMAKE_CURRENT_BINARY_DIR}/checks/sce/metadata.json" --rule-id "${SSG_THIN_DS_RULE_ID}"

comps · 2024-02-15T15:04:17Z

Seems to work for me - at least oscap xccdf eval consumes it and reports a result.

The speed up in calling oscap is nice, even if the content build time takes a lot longer than a "fat" datastream (when building all rules for a product).

$ time oscap xccdf eval --profile '(all)' --rule xccdf_org.ssgproject.content_rule_umask_for_daemons /usr/share/xml/scap/ssg/content/ssg-rhel9-ds.xml
real    0m1.426s
user    0m1.282s
sys     0m0.135s

$ time oscap xccdf eval --profile '(all)' --rule xccdf_org.ssgproject.content_rule_umask_for_daemons ssg-rhel9-ds_umask_for_daemons.xml
real    0m0.178s
user    0m0.161s
sys     0m0.013s

(Some random rule I saw in the list.)

Mab879 · 2024-02-15T22:46:59Z

I tried build the firefox product with --thin and the build failed. I got similar results with the chromium product.

  File "/home/mburket/Developer/ComplianceAsCode/content/build-scripts/build_xccdf.py", line 150, in <module>
    main()
  File "/home/mburket/Developer/ComplianceAsCode/content/build-scripts/build_xccdf.py", line 146, in main
    link_benchmark(loader, xccdftree, args)
  File "/home/mburket/Developer/ComplianceAsCode/content/build-scripts/build_xccdf.py", line 93, in link_benchmark
    link_oval(xccdftree, checks, args.oval, args.build_ovals_dir)
  File "/home/mburket/Developer/ComplianceAsCode/content/build-scripts/build_xccdf.py", line 75, in link_oval
    oval_linker.link()
  File "/home/mburket/Developer/ComplianceAsCode/content/ssg/build_renumber.py", line 171, in link
    self.oval_document = load_oval_document(parse_file(self.fname))
                                            ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mburket/Developer/ComplianceAsCode/content/ssg/xml.py", line 84, in parse_file
    tree = open_xml(filename)
           ^^^^^^^^^^^^^^^^^^
  File "/home/mburket/Developer/ComplianceAsCode/content/ssg/xml.py", line 77, in open_xml
    return ElementTree.parse(filename)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/xml/etree/ElementTree.py", line 1203, in parse
    tree.parse(source, parser)
  File "/usr/lib64/python3.12/xml/etree/ElementTree.py", line 557, in parse
    source = open(source, "rb")
             ^^^^^^^^^^^^^^^^^^
TypeError: expected str, bytes or os.Path

jan-cerny · 2024-02-16T08:51:47Z

build-scripts/compile_all.py

+    if rule_id not in rules and "ALL_RULES" not in rule_id:
+        raise Exception("Rule ID: {} not found!".format(rule_id))
+    if "ALL_RULES" not in rule_id:
+        rules = set(rule_id)


This is wrong. It will create a set of letters instead of a set containing a single item.

jcerny@fedora:~/work/git/scap-security-guide (pr/11549)$ python3 Python 3.12.1 (main, Dec 18 2023, 00:00:00) [GCC 13.2.1 20231205 (Red Hat 13.2.1-6)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> set("ahoj") {'o', 'j', 'a', 'h'} >>>

jan-cerny · 2024-02-16T08:54:45Z

build-scripts/compile_all.py

+def get_profiles_per_rule_by_id(rule_id, project_root_abspath, env_yaml, product_yaml):
+    relevant_benchmarks = get_relevant_benchmarks(env_yaml, product_yaml)
+    rules = find_existing_rules(project_root_abspath, relevant_benchmarks)
+    if rule_id not in rules and "ALL_RULES" not in rule_id:


This special value needs to be mentioned in the help text of the --rule-id option.

jan-cerny · 2024-02-16T08:56:02Z

build-scripts/compile_all.py

@@ -158,8 +205,14 @@ def main():
    controls_manager.remove_selections_not_known(loader.all_rules)
    controls_manager.add_references(loader.all_rules)

-    profiles_by_id = get_all_resolved_profiles_by_id(
-        env_yaml, product_yaml, loader, product_cpes, controls_manager, controls_dir)
+    if args.rule_id is None or args.rule_id == "off":


The special value needs to be described in the help text of the --rule-id option.

jan-cerny · 2024-02-16T09:02:22Z

build-scripts/compile_all.py

+    for rule in rules:
+        data = {
+              'documentation_complete': True,
+              'variables': variables,


Suggested change

'variables': variables,

'variables': variables,

'hidden': True,

I suggest creating a hidden profile so that this fake profile won't appear in the built SCAP data stream.

jan-cerny · 2024-02-16T09:24:18Z

build-scripts/build_xccdf.py

+    )
+    parser.add_argument(
+        "--per-profile-dir",
+        help="Directory to store XCCDF, OVAL, OCIL, per profile. (off: to disable)"


Per profile? I'm confused. It seems like they're per rule. Can you clarify?

The whole process of creating a thin datastream is that a profile is generated for each rule. (In case only one rule is specified is created one profile) This is done in the compile_all.py file. When the --thin flag is set, XCCDF, OVAL and OCIL components need to be generated for each rule. Since each rule has its own profile the generation is done by profiles. (In case the one rule is specified --per-profile-dir flag is disabled). Also, this can be used sometime in the future when we want to build a datastream with only one profile.

OK, so in fact the build system creates fake profiles for each rules and then build_xccdf.py simply creates content driven by each of the fake profiles.

You're right that naming this a "--per-profile-dir" is technically correct because that's the way how things work. However, I'm afraid that it might create more confusion given the way how it's used. If someone takes a look at what is generated into this directory they might not realize that it's internally implemented using the fake profiles.

I suggest renaming this to --thin-ds-components-dir or similar. This name would better describe the actual usage of this option. Moreover, it will still be correct if you consider that a data stream with just a single profile is also a thin data stream.

I suggest renaming also the SSG_PER_PROFILE_COMPONENT_DIR CMake variable in a similar way.

jan-cerny · 2024-02-16T09:33:57Z

build_product

@@ -93,6 +94,8 @@ print_help()
 	printf '\t%s\n' "--derivatives, --no-derivatives: Also build derivatives of products if applicable (off by default)"
 	printf '\t%s\n' "--ansible-playbooks, --no-ansible-playbooks: Build Ansible Playbooks for every profile (on by default)"
 	printf '\t%s\n' "--bash-scripts, --no-bash-scripts: Build Bash remediation scripts for every profile (on by default)"
+	printf '\t%s\n' "-t, --thin, --no-thin: Build thin data streams for each rule. Do not build any of the guides, tables, etc (off by default)"


This takes a horrible amount of time. Do you have any idea why it could take so long?

Because it generates about 1800 thin ds.

Try some products with smaller benchmark. (For example firefox)

Because it generates about 1800 thin ds.

Obviously.

But I was more interested if you know in which part it spends the most time and what slows this down the most. Is it composing the data streams from individual components? Processing compiled rules? Something else?

In theory it should be possible to make the difference between building a single large data stream or many small data streams negligible because these outputs differ only structuring of the output data.

Generating the XCCDF benchmark on a profile and linking it with other components causes the biggest slowdown. I'm working on speeding it up right now. Should I incorporate the changes into this PR or create another one?

I propose doing that in a separate PR. This PR is already large enough.

jan-cerny · 2024-02-16T09:39:52Z

cmake/SSGCommon.cmake

@@ -751,6 +751,10 @@ macro(ssg_build_product PRODUCT)
        set(PRODUCT_${_LANGUAGE}_REMEDIATION_ENABLED TRUE)
    endforeach()

+    set(SSG_PER_PROFILE_COMPONET_DIR "off")


jan-cerny · 2024-02-16T09:40:39Z

build-scripts/build_xccdf.py

+        p = Paths_(
+            xccdf=get_path(args.per_profile_dir, "xccdf_{}.xml".format(id_)),
+            oval=get_path(args.per_profile_dir, "oval_{}.xml".format(id_)),
+            ocil=get_path(args.per_profile_dir, "ocil_{}.xml".format(id_)),


Can we not include OCIL at all?

If the rule does not contain OCIL. OCIL does not appear in the final ds.

So I'd keep it.

But if the rule contains OCIL, it will appear in the final DS 😄

My point is that the OCIL is useless for the use cases of the thin data stream, so I wouldn't include it at all.

(Side note: our OCIL is useless for all purposes but we need to keep it in the normal DS because of compatibility)

Would it be easy or difficult to not add OCILs to thin DSs?

Yes, it's simple it's just a few lines of code.

Disable build of OCIL for thin ds

Honny1 · 2024-02-16T13:13:32Z

@Mab879 The problem you found should be fixed.

Mab879 · 2024-02-16T13:33:27Z

@Mab879 The problem you found should be fixed.

Confirmed, both the firefox and chromium products build now with -t.

Thanks!

jan-cerny

The outputs look good. We can solve further content size reduction in separate PRs.

Please see my comments.

jan-cerny · 2024-02-19T15:29:14Z

CMakeLists.txt

@@ -275,6 +279,7 @@ message(STATUS "Separate SCAP files: ${SSG_SEPARATE_SCAP_FILES_ENABLED}")
 message(STATUS "Ansible Playbooks: ${SSG_ANSIBLE_PLAYBOOKS_ENABLED}")
 message(STATUS "Ansible Playbooks Per Rule: ${SSG_ANSIBLE_PLAYBOOKS_PER_RULE_ENABLED}")
 message(STATUS "Bash scripts: ${SSG_BASH_SCRIPTS_ENABLED}")
+message(STATUS "Thin Datastreams: ${SSG_THIN_DS}")


Suggested change

message(STATUS "Thin Datastreams: ${SSG_THIN_DS}")

message(STATUS "Thin data streams: ${SSG_THIN_DS}")

jan-cerny · 2024-02-19T15:30:23Z

build-scripts/compile_all.py

+        "--rule-id",
+        type=str,
+        help="Creates a profile with the specified rule and does not use other profiles."
+        "It is named in the RULE_ID format. If you want to process "


I suggest something like: The profile ID is identical to the rule ID of the selected rule.

jan-cerny · 2024-02-19T15:31:42Z

ssg/build_yaml.py

+        if self.benchmark is None:
+            raise Exception(
+                "Before generating benchmarks for each profile, you need to load "
+                "the initial benchmark. Using the load_benchmark method."


Suggested change

"the initial benchmark. Using the load_benchmark method."

"the initial benchmark using the load_benchmark method."

github-actions · 2024-02-20T10:09:17Z

🤖 A k8s content image for this PR is available at:
ghcr.io/complianceascode/k8scontent:11549

Click here to see how to deploy it

If you alread have Compliance Operator deployed:
utils/build_ds_container.py -i ghcr.io/complianceascode/k8scontent:11549

Otherwise deploy the content and operator together by checking out ComplianceAsCode/compliance-operator and:
CONTENT_IMAGE=ghcr.io/complianceascode/k8scontent:11549 make deploy-local

codeclimate · 2024-02-20T12:46:39Z

Code Climate has analyzed commit 6b4b49f and detected 0 issues on this pull request.

The test coverage on the diff in this pull request is 13.7% (50% is the threshold).

This pull request will bring the total coverage in the repository to 58.1% (-0.1% change).

View more on Code Climate.

jan-cerny · 2024-02-21T08:25:01Z

/packit retest-failed

jan-cerny · 2024-02-21T08:57:35Z

/packit retest-failed

jan-cerny

I have generated thin DSs for some of my favourite rules. Then I tried to use them for a scan. It worked as expected.

Also, I have tried to generate thin DSs for all rules, this also worked.

At this moment the PR is good enough so I will merge it. We will continue in separate PRs to avoid bloating this PR and complicating reviews.

I suggest focusing on the following aspects in future PRs:

Improve awful performance of the -t option. If the performance will be improved, the feature will suddenly start to be useful for testing in downstream and CI.
Removing excess XCCDF Groups and Profiles that aren't essential
Integration test

Great job!

jan-cerny · 2024-02-21T10:47:44Z

/packit retest-failed

jan-cerny · 2024-02-21T11:36:41Z

/packit retest-failed

openshift-ci bot added the do-not-merge/work-in-progress Used by openshift-ci bot. label Feb 6, 2024

Honny1 force-pushed the thin-ds-cli branch 3 times, most recently from cd4904d to f2d31c0 Compare February 6, 2024 18:21

jan-cerny reviewed Feb 7, 2024

View reviewed changes

jan-cerny self-assigned this Feb 7, 2024

jan-cerny added this to the 0.1.73 milestone Feb 7, 2024

Honny1 force-pushed the thin-ds-cli branch 8 times, most recently from efd1aa4 to e5fdf78 Compare February 12, 2024 16:43

openshift-merge-robot added the needs-rebase Used by openshift-ci bot. label Feb 13, 2024

Honny1 force-pushed the thin-ds-cli branch from e5fdf78 to 06bc135 Compare February 13, 2024 14:40

openshift-merge-robot removed the needs-rebase Used by openshift-ci bot. label Feb 13, 2024

Honny1 force-pushed the thin-ds-cli branch from 421aea7 to ecf2532 Compare February 14, 2024 14:06

Honny1 added the Infrastructure Our content build system label Feb 15, 2024

Honny1 marked this pull request as ready for review February 15, 2024 14:07

openshift-ci bot removed the do-not-merge/work-in-progress Used by openshift-ci bot. label Feb 15, 2024

Honny1 requested a review from jan-cerny February 15, 2024 15:26

Honny1 force-pushed the thin-ds-cli branch from 0d0560d to de414c2 Compare February 15, 2024 16:14

jan-cerny reviewed Feb 16, 2024

View reviewed changes

Honny1 force-pushed the thin-ds-cli branch from de414c2 to 8de2167 Compare February 16, 2024 10:55

Honny1 force-pushed the thin-ds-cli branch from 8de2167 to 86311ec Compare February 16, 2024 12:26

Honny1 requested a review from jan-cerny February 16, 2024 13:11

Honny1 force-pushed the thin-ds-cli branch from c6d3411 to e483d26 Compare February 19, 2024 10:21

jan-cerny reviewed Feb 19, 2024

View reviewed changes

Honny1 added 4 commits February 20, 2024 10:54

Rework build_xccdf.py

e7bf5bf

Add entrypoints to the build_product script

ff7f5c2

Create profile per rule

964a76d

Format long lines

59ac346

Honny1 force-pushed the thin-ds-cli branch from e483d26 to 4fe0f5f Compare February 20, 2024 10:04

Honny1 added 8 commits February 20, 2024 13:31

Prepare for generation benchmark per profile

61a4e01

Generates components per profile

da07735

Enable component generation per profile

e4e6da8

Add compose of multiple ds

20ce668

Deal with empty OVAL and OCIL components

b233ac9

Add XML declaration

146b9ee

Fix getting relevant benchmarks for a product

66eedd1

Disable build of OCIL for thin DS

6b4b49f

Honny1 force-pushed the thin-ds-cli branch from 241cac4 to 6b4b49f Compare February 20, 2024 12:32

jan-cerny approved these changes Feb 21, 2024

View reviewed changes

jan-cerny merged commit 8618a1d into ComplianceAsCode:master Feb 21, 2024
42 of 44 checks passed

	'variables': variables,
	'variables': variables,
	'hidden': True,

	message(STATUS "Thin Datastreams: ${SSG_THIN_DS}")
	message(STATUS "Thin data streams: ${SSG_THIN_DS}")

	"the initial benchmark. Using the load_benchmark method."
	"the initial benchmark using the load_benchmark method."

Thin DS: Command Line Interface #11549

Thin DS: Command Line Interface #11549

Conversation

Honny1 commented Feb 6, 2024 • edited Loading

Description:

--rule-id RULE_ID

--thin

Review Hints:

openshift-ci bot commented Feb 6, 2024

github-actions bot commented Feb 6, 2024

jan-cerny left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comps commented Feb 15, 2024

Mab879 commented Feb 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Honny1 commented Feb 16, 2024

Mab879 commented Feb 16, 2024 • edited Loading

jan-cerny left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Feb 20, 2024

codeclimate bot commented Feb 20, 2024

jan-cerny commented Feb 21, 2024

jan-cerny commented Feb 21, 2024

jan-cerny left a comment

Choose a reason for hiding this comment

jan-cerny commented Feb 21, 2024

jan-cerny commented Feb 21, 2024

Honny1 commented Feb 6, 2024 •

edited

Loading

`--rule-id` RULE_ID

`--thin`

Mab879 commented Feb 16, 2024 •

edited

Loading