Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix PII pseudonymizer enrichment + unit tests #336

Merged
merged 6 commits into from
Sep 23, 2020
Merged

Conversation

benjben
Copy link
Contributor

@benjben benjben commented Sep 8, 2020

The test failing is due to http://snowplow-hosted-assets.s3.amazonaws.com/third-party/maxmind/GeoLite2-City.mmdb being removed/empty

This will be fixed in another commit.

@benjben benjben changed the title Fix PII enrichment + unit tests Fix PII pseudonymizer enrichment + unit tests Sep 8, 2020
Copy link
Contributor

@chuwy chuwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@benjben benjben marked this pull request as draft September 9, 2020 13:18
@benjben benjben marked this pull request as ready for review September 9, 2020 17:11
@benjben benjben requested a review from chuwy September 11, 2020 08:18
@dilyand
Copy link
Contributor

dilyand commented Sep 15, 2020

@dilyand don't we want to leave an empty string empty in the hashed output ?

@benjben I'm not sure. But:

  • If we leave an empty string empty but hash all other strings, we are making a decision that an empty string is not PII and so does not need to be pseudonymized.
  • Depending on context, we might make that call in other, more complex situations. For example, if the field is email_address but the value does not match a 'valid email' regex, then it's not PII and we can leave it unhashed.
  • Fundamentally, both of the above are the same type of sanitisation, the only difference is the complexity of the 'regex' that needs to be matched.
  • Since we're not doing 2, we shouldn't be doing 1.
  • Maybe in some situations knowing that a field is an empty string will reveal some PII? (Though this point is weakened by the fact that all empty strings will have the same hash, so they will be easy to recognise.)
  • The case with null value is different. There we return no value and not a hash, because there is nothing to hash, not because we decide it's not sensitive.

I'm happy to consider counterpoints, but with the above in mind I'm leaning towards hashing the empty string.

WDYT?

@benjben
Copy link
Contributor Author

benjben commented Sep 15, 2020

If we leave an empty string empty but hash all other strings, we are making a decision that an empty string is not PII

IMHO that's the case. And hashing the empty strings will always produce the same hash.

if the field is email_address but the value does not match a 'valid email' regex, then it's not PII and we can leave it unhashed.

IMHO hashing of PII is not related to data validation and we should hash whatever there is to hash.

@dilyand
Copy link
Contributor

dilyand commented Sep 15, 2020

In that case, the test reveals a bug...

I think I agree with:

hashing of PII is not related to data validation and we should hash whatever there is to hash

but we differ in that I think this includes all values, including empty string.

Btw, do you want to keep this test? I can rebase to merge my commit with your first one.

@benjben
Copy link
Contributor Author

benjben commented Sep 15, 2020

In that case, the test reveals a bug...

But that doesn't reproduce the error "" => object 🤔

do you want to keep this test?

Sure !! The more testing the better !

@chuwy
Copy link
Contributor

chuwy commented Sep 15, 2020

@dilyand, @benjben let me know when its ready for review again - I'd like to start putting 1.4.0 together.

@chuwy
Copy link
Contributor

chuwy commented Sep 16, 2020

Hey @benjben! Are you ok to fix the #351 in this branch and potentially incorporate Dilyan's commit.

@chuwy chuwy removed their request for review September 16, 2020 17:18
@benjben
Copy link
Contributor Author

benjben commented Sep 16, 2020

Sure @chuwy ! So just to be sure that we don't duplicate work, this is what you first wanted to take care of but finally I do right ?

@chuwy
Copy link
Contributor

chuwy commented Sep 16, 2020

@benjben nope, or at least I don't remember taking care of it. I implemented the recovery for a client.

@chuwy
Copy link
Contributor

chuwy commented Sep 16, 2020

Oh, actually you're right - I wanted to do this! Sorry. Yeah, let's decide tomorrow - I'm just signing off today and thought you'll be working on something. Ignore me.

Copy link
Contributor

@chuwy chuwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand evertying that is going on here, but happy that we have tests.

My implementation of removeAddedFields is here just for a reference, feel free to add or ignore, although it's a bit shorter.

Copy link
Member

@oguzhanunlu oguzhanunlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@benjben benjben merged commit 1e937a1 into develop Sep 23, 2020
@benjben benjben deleted the feature/unit_tests branch September 23, 2020 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants