Skip to content
This repository has been archived by the owner on Jun 4, 2021. It is now read-only.
/ beam-enrich Public archive

Dataflow job reading tracked events from PubSub, validating and enriching them and writing them back to PubSub

Notifications You must be signed in to change notification settings

snowplow/beam-enrich

Repository files navigation

Beam Enrich has been moved to the snowplow/enrich repo, where it is still actively maintained.

Beam Enrich

Introduction

Beam Enrich processes raw Snowplow events from an input GCP PubSub subscription, enrich them and store them into an output PubSub topic. Events are enriched using the scala-common-enrich library.

Building

This project uses sbt-native-packager.

Zip archive

To build the zip archive, run:

sbt universal:packageBin

Docker image

To build a Docker image, run:

sbt docker:publishLocal

Running

Directly

Once unzipped the artifact can be run as follows:

./bin/beam-enrich \
  --runner=DataFlowRunner \
  --project=project-id \
  --streaming=true \
  --zone=europe-west2-a \
  --gcpTempLocation=gs://location/ \
  --raw=projects/project/subscriptions/raw-topic-subscription \
  --enriched=projects/project/topics/enriched-topic \
  --bad=projects/project/topics/bad-topic \
  --pii=projects/project/topics/pii-topic \ #OPTIONAL
  --resolver=iglu_resolver.json \
  --labels={"label": "value"} \ #OPTIONAL
  --enrichments=enrichments/

To display the help message:

./bin/beam-enrich \
  --runner=DataFlowRunner \
  --help

Through a docker container

A container can be run as follows:

docker run \
  -v $PWD/config:/snowplow/config \
  -e GOOGLE_APPLICATION_CREDENTIALS=/snowplow/config/credentials.json \ # if running outside GCP
  snowplow/beam-enrich:1.2.0 \
  --runner=DataFlowRunner \
  --job-name=snowplow-enrich \
  --project=project-id \
  --streaming=true \
  --zone=europe-west2-a \
  --gcpTempLocation=gs://location/ \
  --raw=projects/project/subscriptions/raw-topic-subscription \
  --enriched=projects/project/topics/enriched-topic \
  --bad=projects/project/topics/bad-topic \
  --pii=projects/project/topics/pii-topic \ #OPTIONAL
  --resolver=/snowplow/config/iglu_resolver.json \
  --labels={"label": "value"} \ #OPTIONAL
  --enrichments=/snowplow/config/enrichments/

To display the help message:

docker run snowplow/beam-enrich:1.2.0 \
  --runner=DataFlowRunner \
  --help

Additional information

Note that, for the enrichments relying on local files, the files need to be accessible from Google Cloud Storage, e.g. for the IP lookups enrichment:

{
  "schema": "iglu:com.snowplowanalytics.snowplow/ip_lookups/jsonschema/2-0-0",
  "data": {
    "name": "ip_lookups",
    "vendor": "com.snowplowanalytics.snowplow",
    "enabled": true,
    "parameters": {
      "geo": {
        "database": "GeoLite2-City.mmdb",
        "uri": "gs://beam-enrich-test/maxmind"
      }
    }
  }
}

A full list of all the Beam CLI options can be found at: https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options.

Testing

To run the tests:

sbt test

REPL

To experiment with the current codebase in Scio REPL simply run:

sbt repl/run

Find out more

Technical Docs Setup Guide
i1 i2
Technical Docs Setup Guide

Copyright and license

Copyright 2018-2020 Snowplow Analytics Ltd.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

About

Dataflow job reading tracked events from PubSub, validating and enriching them and writing them back to PubSub

Resources

Stars

Watchers

Forks

Packages

No packages published