Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compare_query_columns to compare multiple columns across two queries #100

Open
mbyrne00 opened this issue May 24, 2024 · 0 comments
Open
Labels
enhancement New feature or request triage

Comments

@mbyrne00
Copy link

mbyrne00 commented May 24, 2024

Describe the feature

I would love the output from compare_all_columns to be applicable to multiple columns across two queries. Its like compare_column_values, but supporting multiple columns and thus using the output format from compare_all_columns.

Happy to consider alternative naming for macro and args, but below should give clear indication on what I would like.

Describe alternatives you've considered

Alternative is to use all of these individually.

Additional context

I see that compare_all_columns depends on using column definitions defined in the relations, so that means we should just pass them along.

Example summary

{% set old_query %}
    select * from old_database.old_schema.dim_product
    where is_latest
{% endset %}

{% set new_query %}
    select * from {{ ref('dim_product') }}
{% endset %}

{{ audit_helper.compare_query_columns(
    a_query = old_query,
    b_query = new_query,
    primary_key = "product_id",
    columns = ["status", "amount", "some_other_column"], 
) }}

Summary output

column_name perfect_match null_in_a null_in_b missing_from_a missing_from_b conflicting_values
status 10 0 0 0 0 0
amount 2 0 0 0 0 8
some_other_column 6 4 4 0 0 0

Example verbose

{% set old_query %}
    select * from old_database.old_schema.dim_product
    where is_latest
{% endset %}

{% set new_query %}
    select * from {{ ref('dim_product') }}
{% endset %}

{{ audit_helper.compare_query_columns(
    a_query = old_query,
    b_query = new_query,
    primary_key = "product_id",
    columns = ["status", "amount", "some_other_column"], 
   summarize=False,
) }}

Detailed response

primary_key column_name perfect_match null_in_a null_in_b missing_from_a missing_from_b conflicting_values
1 status true false false false false false
1 amount false false false false false true
1 some_other_column false true true false false false
... ... ... ... ... ... ... ...

Who will this benefit?

This is useful for comparing source and target queries in detail in one go, without relations defined and without comparing all columns, or repeating many times for a single specific column.

Are you interested in contributing this feature?

Yes, I would consider contributing only if it is deemed useful and likely to be merged into the lib.

Great lib, and of course it's simple enough for me to do in my own codebase, but also happy to enrich this lib for others if there is a general desire.

@mbyrne00 mbyrne00 added enhancement New feature or request triage labels May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage
Projects
None yet
Development

No branches or pull requests

1 participant