Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT(capa2sarif) Add SARIF conversion script from json output #2093

Merged
merged 11 commits into from
Jun 11, 2024

Conversation

ReversingWithMe
Copy link
Contributor

SARIF gets you navigation for binary beacons from capa in any tool that supports SARIF(e.g., ghidra/radare/ida). I expect this to be a core format for binary analysis in the future.

SARIF-SASP-Introduction-figures

The Static Analysis Results Interchange Format (SARIF) is a standardized format for the output of static analysis tools, which are used to evaluate source or binary for things like vulnerabilities or dataflow. SARIF enables different analysis tools to produce results in a common format that can be easily understood, integrated, and acted upon by software development tools and systems. E.g. vscode, ghidra, radare2, and github all adopt a common standard for representing types of information.

SARIF describes: the analysis being ran and results from an analysis on an artifact. Results include description of artifacts related to a run of the tool where artifact is source code, binary file, and auxiliary data files. Results also include the invocation or how the tool was run, including version, command line, any knobs/parameters. The idea being you can reconstruct where output data came from foe things that depend on parameters on specific input. Results themselves are captured via "rules" where it is some type of analysis, one could imagine a single rule identifier for all of capa, but that wouldn't be very useful. For each rule/type of information, there is a single message for the finding as well as a property bag which you can shove anything into.

This PR adds a new script that takes in a CAPA output file (~7.0) and converts the json to SARIF (a JSON with additional schema). This is a clean start from a previous PR to clean up branch history from embedding this as argument flags in capa directly. Potentially if this feature gets enough usage and is stable enough, adding a specific renderer is desired, but that may prefer doing natively instead of 3rd party deps.

This includes additional features for Radare specific and Ghidra specific current requirements. I expect both of these to get fixed over time.

Steps to test functionality

  1. python3 -m venv venv
  2. source venv/bin/activate
  3. python3 -m pip install -e .[dev]
  4. git submodule init
  5. git submodule update
  6. capa --json tests/data/5d7c34b6854d48d3da4f96b71550a221.exe_ > capa_result.json
  7. python3 -m json.tool capa_result.json // test json compliance
  8. python3 scripts/capa2sarif.py capa_result.json -r > capa_radare.sarif
  9. python3 -m json.tool capa_radare.sarif // test json compliance
  10. r2 tests/data/5d7c34b6854d48d3da4f96b71550a221.exe_
  11. > sarif -i capa_radare.sarif
  12. > sarif -l

In ghidra, similar but -g instead of -r. Enable SARIF extension from install extensions. Sarif > Read File > capa_ghidra.sarif

Interactive table spawns
image

Checklist

  • No CHANGELOG update needed
  • No new tests needed
  • No documentation update needed

ReversingWithMe and others added 2 commits May 27, 2024 08:52
@ReversingWithMe
Copy link
Contributor Author

Clean up from this PR #2036

@williballenthin williballenthin self-requested a review June 6, 2024 08:43
Copy link
Collaborator

@williballenthin williballenthin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good!

Thanks for taking the time to introduce us to SARIF and provide the script. The logic looks reasonable, and aside from some nits that I noticed, I don't see any reason not to merge this soon.

One idea: rather than interacting with the capa JSON, you might want to deserialize it into the ResultDocument format that capa provides. This has full type hints that mypy checks, whereas the JSON document doesn't have any codified schema. Therefore, if we ever change the JSON document, we'd only notice bugs when this script breaks. By using the type checked ResultDocument, we can catch that with static analysis tools. That being said, I recongize this would take you a bit more work, so I understand if you can't make the changes now. We can do it at the first bug ;-)

pyproject.toml Outdated Show resolved Hide resolved
scripts/capa2sarif.py Show resolved Hide resolved
scripts/capa2sarif.py Outdated Show resolved Hide resolved
scripts/capa2sarif.py Outdated Show resolved Hide resolved
scripts/capa2sarif.py Outdated Show resolved Hide resolved
scripts/capa2sarif.py Outdated Show resolved Hide resolved
@williballenthin
Copy link
Collaborator

recommend also adding a trivial test to test_scripts.py to show that this script can generate output without hitting exceptions, which we can then verify in CI.

@ReversingWithMe
Copy link
Contributor Author

ReversingWithMe commented Jun 7, 2024

These are reasonable will work on adding today, thanks!

@ReversingWithMe
Copy link
Contributor Author

ReversingWithMe commented Jun 7, 2024

This should address the above suggestions, thanks again! I am not sure on the test, but using an existing result document seems to be a good testcase (granted this would NOT catch breaking changes if JSON changes over time).

I am not sure a good way to take input file, run capa, run script after currently, I think this current approach is wrong though.

@williballenthin
Copy link
Collaborator

i think you could use any of the json files in capa-testfiles/rd/ as the input. We'll update those if the format ever changes. No need to invoke capa in the test to generate the json.

Copy link
Collaborator

@williballenthin williballenthin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!

tests/test_scripts.py Show resolved Hide resolved
@williballenthin
Copy link
Collaborator

please resolve merge conflicts and then i'll merge!

@williballenthin williballenthin merged commit 52e24e5 into mandiant:master Jun 11, 2024
8 of 9 checks passed
@williballenthin
Copy link
Collaborator

thank you @ReversingWithMe!

ygasparis pushed a commit to ygasparis/capa that referenced this pull request Jun 18, 2024
…nt#2093)

* feat(capa2sarif): add new sarif conversion script converting json output to sarif schema, update dependencies, and update changelog

* fix(capa2sarif): removing copy and paste transcription errors

* fix(capa2sarif): remove dependencies from pyproject toml to guarded import statements

* chore(capa2sarif): adding node in readme specifying dependency and applied auto formatter for styling

* style(capa2sarif): applied import sorting and fixed typo in invocations function

* test(capa2sarif): adding simple test for capa to sarif conversion script using existing result document

* style(capa2sarif): fixing typo in version string in usage

* style(capa2sarif): isort failing due to reordering of typehint imports

* style(capa2sarif): fixing import order as isort on local machine was not updating code

---------

Co-authored-by: ReversingWithMe <ryanv@rewith.me>
Co-authored-by: Willi Ballenthin <wballenthin@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants