Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream Enrich: New Hope #328

Merged
merged 4 commits into from
Oct 21, 2020
Merged

Stream Enrich: New Hope #328

merged 4 commits into from
Oct 21, 2020

Conversation

chuwy
Copy link
Contributor

@chuwy chuwy commented Sep 1, 2020

@chuwy chuwy changed the base branch from master to feature/cats2 September 1, 2020 11:47
@chuwy chuwy changed the title Stream Enrich NG Stream Enrich: New Hope Sep 14, 2020
@chuwy chuwy force-pushed the feature/cats2 branch 2 times, most recently from 8ea132b to ada7cd5 Compare September 15, 2020 22:28
@chuwy chuwy force-pushed the feature/fs2-enrich branch 3 times, most recently from 352044f to 15da7b7 Compare September 15, 2020 22:56
}
.map(enriched => Payload(enriched, row.ack))

result.handleErrorWith(sendToSentry[F](row, sentry))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we send the payload rather than the row ? So that to troubleshoot we don't need to Thrift deserialize it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think sendToSentry is probably slightly misleading. It does three things:

  1. Sends an exception to Sentry (we cannot send anything from an event because it can contain PII data
  2. Creates a generic_error bad row - that's why we need a row
  3. Logs an error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is, why not use payload instead of row in the generic bad row created? An array of Thrift bytes is not very useful to troubleshoot, compared with a BadRow (CPFormatViolation) or a CollectorPayload

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok makes sense now!

The reason was that our payload is something like ValidatedNel[CPFormatViolation, Option[CollectorPayload]], so we don't really have a parsed payload yet. We technicall can pattern-match on it, like:

payload match {
  case Validated.Invalid(errors) => // what to do here? why we were trying to process error in a first place
    errors // thrift bytes anyway
  case Validated.Valid(payload) =>
    turnIntoAdapterFailure(payload) // but what if it's not an adapter failure? enrichment failure? then we need a raw event to construct bad row

And also feels weird to produce different kinds of bad row from the same place, so I decided to stick with the most generic one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of putting something like show"$payload" in the generic_error, whatever it contains.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in that case we won't have a clear way to recover it. I think it's an important promise that whenever you have generic_error coming from enrich you need to be able to base64 payload in order to recover it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually a bug that ThriftLoader.toCollectorPayload's signature says that it can return multiple bad rows, it can return only one.

So in the generic_error bad row we could put either the single CPFormatViolation (recoverable) or the raw extracted CollectorPayload (recoverable).

@chuwy chuwy force-pushed the feature/cats2 branch 3 times, most recently from 2cbe8b4 to e3dfc3e Compare September 17, 2020 12:49
@chuwy chuwy changed the base branch from feature/cats2 to develop September 17, 2020 14:04
@chuwy chuwy force-pushed the feature/fs2-enrich branch 4 times, most recently from 2b8c98f to 3f425e3 Compare September 20, 2020 14:31
@chuwy chuwy force-pushed the feature/fs2-enrich branch 4 times, most recently from 8df2f4b to e80962a Compare October 6, 2020 10:42
@chuwy chuwy requested a review from a team October 6, 2020 10:43
@chuwy chuwy force-pushed the feature/fs2-enrich branch 2 times, most recently from 8b0d3a3 to cc5d6b1 Compare October 6, 2020 16:12
@chuwy
Copy link
Contributor Author

chuwy commented Oct 11, 2020

Hey @benjben! I adressed all your feedback, added few more tests and couple of tickets (#370 - depends on NH, #371 was also discovered while I was writing tests). If anyone else from @snowplow/com-snowplowanalytics-engineering-datacapability wants to have a look - you're welcome. Otherwise this should be ready.

@chuwy chuwy force-pushed the feature/fs2-enrich branch 6 times, most recently from 9851e1e to 6e25f3b Compare October 16, 2020 16:27
@chuwy chuwy force-pushed the feature/fs2-enrich branch 6 times, most recently from f0acad3 to a874162 Compare October 17, 2020 08:40
@@ -0,0 +1,23 @@
auth = {
type = "Gcp"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Acronyms are generally all-caps. Is there a specific reason to use Gcp?
  • From a user's perspective, it'd be useful if we could see valid values of configuration fields, e.g. is gcp or GCP valid here? Or do we want to rely on user-friendly error messages explaining what's wrong and how to fix?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • It'd be nice to see which fields are optional and and which values are used by default when applicable e.g. assetsUpdatePeriod if it is not configured

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling was dictated by codecs deriivation (it uses a exact case class name). I decided no to change it for now as Gcp is the only valid value here, but I agree this is something that should be fixed.

I'll add comments to the config file.


object State {

/** Test pair is used in tests to initialize HTTP client */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have it here if it is used in tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at L92, it seems it isn't used in tests only, could you update this scaladoc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice your comment and posted the last one, still scaladoc needs an update I think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why production code needs to know anything about test code?

Copy link
Contributor

@lukeindykiewicz lukeindykiewicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general! It's a bit to big PR to read it very carefully.

if: ${{ always() }}
run: sbt coveralls
env:
COVERALLS_REPO_TOKEN: ${{ secrets.COVERALLS_REPO_TOKEN }}
- name: Check Scala formatting
if: ${{ always() }}
run: sbt scalafmtCheck
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you change to scalafmtCheckAll and add scalafmtSbtCheck, please?


object State {

/** Test pair is used in tests to initialize HTTP client */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why production code needs to know anything about test code?

final case class Hash private (s: String) extends AnyVal

object Hash {
private[this] def fromBytes(bytes: Array[Byte]): Hash = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This small function should have a test

// side-effecting get-set is inherently not thread-safe
// we need to be sure the state.stop is set to true
// before re-initializing enrichments
_ <- Logger[F].info(s"Unpausing enrich stream")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unpausing -> Resuming

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to stick to one, either show or s.

@chuwy chuwy force-pushed the feature/fs2-enrich branch 2 times, most recently from bfc2b5f to c61fa01 Compare October 21, 2020 11:27
@chuwy chuwy merged commit c61fa01 into develop Oct 21, 2020
@chuwy chuwy deleted the feature/fs2-enrich branch October 21, 2020 14:54
@chuwy chuwy mentioned this pull request Oct 21, 2020
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants