Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-glue-alpha: S3-Table table properties are added to the wrong parameters section #27365

Open
guyernest opened this issue Sep 30, 2023 · 4 comments
Labels
@aws-cdk/aws-glue Related to AWS Glue bug This issue is a bug. p2

Comments

@guyernest
Copy link

guyernest commented Sep 30, 2023

Describe the bug

The TableInput section in the Glue AWS::Glue::Table has two different Parameters sections, one for the storage and one for the table. The current implementation of the S3-Table puts all the custom parameters into the StorageDescriptor section Parameters and leaves the other hard-coded.

The use case is for dynamic-partitioning, which uses projection.<dynamic-partitioning>.format and similar parameters to define the way that Glue (and Athena) will parse the dynamic partitioning field. This is a common way to archive data into S3 using Kinesis Firehose.

Expected Behavior

When using the following code in the CDK:

    var replication_table = new glue.S3Table(this, 'ReplicationTable', {
      database: replication_database,
      tableName: <Glue-Table-Name>, 
      columns: <Columns>,
      partitionKeys: [{
        name: 'datehour',
        type: glue.Schema.STRING,
      }],
      bucket: eventsBucket,
      s3Prefix: 'events/table=<DynamoDB-Table-Name>/',
      storedAsSubDirectories: true,
      storageParameters: [
        glue.StorageParameter.compressionType(glue.CompressionType.GZIP),
        // The parameters that are relevant for the calculation of the dynamic partitioning
        // glue.StorageParameter.custom('projection.enabled', 'true'), 
        glue.StorageParameter.custom('projection.enabled', 'true'),
        glue.StorageParameter.custom('projection.datehour.type', 'date'), 
        glue.StorageParameter.custom('projection.datehour.format', 'yyyy/MM/dd'), 
        glue.StorageParameter.custom('projection.datehour.range', '2021/01/01,NOW'), 
        glue.StorageParameter.custom('projection.datehour.interval', '1'), 
        glue.StorageParameter.custom('projection.datehour.interval.unit', 'DAYS'), 
        glue.StorageParameter.custom('storage.location.template', 's3://cdk-event-log-multi-definition-use-case-poc-events-bucket/events/table=cdk-event-log-multi-definition-use-case-poc-table/${datehour}/'), 
        glue.StorageParameter.custom("EXTERNAL", 'TRUE'),
        glue.StorageParameter.custom("compressionType", 'gzip'),
      ],
      dataFormat: glue.DataFormat.JSON,
      enablePartitionFiltering: true,
      compressed: true,
    });

I expect to get the following CFN snippet:

  "ReplicationTable2E30ABDE": {
   "Type": "AWS::Glue::Table",
   "Properties": {
    "CatalogId": {
     "Ref": "AWS::AccountId"
    },
    "DatabaseName": {
     "Ref": "DatabaseB269D8BB"
    },
    "TableInput": {
     "Name": <Glue-Table-Name>,
     "Parameters": {
      "classification": "json",
      "partition_filtering.enabled": true,
      "projection.enabled": "true",
      "projection.datehour.type": "date",
      "projection.datehour.format": "yyyy/MM/dd",
      "projection.datehour.range": "2021/01/01,NOW",
      "projection.datehour.interval": "1",
      "projection.datehour.interval.unit": "DAYS",
      "storage.location.template": "s3://<Replication-Bucket>/events/table=<DynamoDB-Table-Name>/${datehour}/",
      "EXTERNAL": "TRUE",
      "compressionType": "gzip"
     },
     "PartitionKeys": [
      {
       "Name": "datehour",
       "Type": "string"
      }
     ],
     "StorageDescriptor": {
      "Columns": [<Columns>],
      "Compressed": true,
      "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
      "Location": {
       "Fn::Join": [
        "",
        [
         "s3://",
         {
          "Ref": "EventsBucketCD4657F9"
         },
         "/events/table=<DynamoDB-Table-Name>/"
        ]
       ]
      },
      "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
      "Parameters": {
       "compression_type": "gzip"
      },
      "SerdeInfo": {
       "SerializationLibrary": "org.openx.data.jsonserde.JsonSerDe"
      },
      "StoredAsSubDirectories": true
     },
     "TableType": "EXTERNAL_TABLE"
    }
   }

Please note that the parameters are under the Table Input.

Current Behavior

Instead I get the following stack Snippet:

  "Type": "AWS::Glue::Table",
   "Properties": {
    "CatalogId": {
     "Ref": "AWS::AccountId"
    },
    "DatabaseName": {
     "Ref": "DatabaseB269D8BB"
    },
    "TableInput": {
     "Name": <Glue-Table-Name>,
     "Parameters": {
      "classification": "json",
      "has_encrypted_data": true,
      "partition_filtering.enabled": true
     },
     "PartitionKeys": [
      {
       "Name": "datehour",
       "Type": "string"
      }
     ],
     "StorageDescriptor": {
      "Columns": [ <Columns>],
      "Compressed": true,
      "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
      "Location": {
       "Fn::Join": [
        "",
        [
         "s3://",
         {
          "Ref": "EventsBucketCD4657F9"
         },
         "/events/table=<DynamoDB-Table-Name>/"
        ]
       ]
      },
      "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
      "Parameters": {
       "compression_type": "gzip",
       "projection.datehour.enabled": "true",
       "projection.datehour.type": "date",
       "projection.datehour.format": "yyyy/MM/dd",
       "projection.datehour.range": "2021/01/01,NOW",
       "projection.datehour.interval": "1",
       "projection.datehour.interval.unit": "DAYS",
       "storage.location.template": "s3://<Replication-Bucket>/events/table=<DynamoDB-Table-Name>/${datehour}/",
       "EXTERNAL": "TRUE",
       "compressionType": "gzip"
      },
      "SerdeInfo": {
       "SerializationLibrary": "org.openx.data.jsonserde.JsonSerDe"
      },
      "StoredAsSubDirectories": true
     },
     "TableType": "EXTERNAL_TABLE"
    }
   }

Please note that the dynamic partitioning parameters are added to the wrong parameters section.

Reproduction Steps

Use a similar code in your stack definition under /lib:

    var replication_table = new glue.S3Table(this, 'ReplicationTable', {
      database: replication_database,
      tableName: <Glue-Table-Name>, 
      columns: <Columns>,
      partitionKeys: [{
        name: 'datehour',
        type: glue.Schema.STRING,
      }],
      bucket: eventsBucket,
      s3Prefix: 'events/table=<DynamoDB-Table-Name>/',
      storedAsSubDirectories: true,
      storageParameters: [
        glue.StorageParameter.compressionType(glue.CompressionType.GZIP),
        // The parameters that are relevant for the calculation of the dynamic partitioning
        // glue.StorageParameter.custom('projection.enabled', 'true'), 
        glue.StorageParameter.custom('projection.enabled', 'true'),
        glue.StorageParameter.custom('projection.datehour.type', 'date'), 
        glue.StorageParameter.custom('projection.datehour.format', 'yyyy/MM/dd'), 
        glue.StorageParameter.custom('projection.datehour.range', '2021/01/01,NOW'), 
        glue.StorageParameter.custom('projection.datehour.interval', '1'), 
        glue.StorageParameter.custom('projection.datehour.interval.unit', 'DAYS'), 
        glue.StorageParameter.custom('storage.location.template', 's3://cdk-event-log-multi-definition-use-case-poc-events-bucket/events/table=cdk-event-log-multi-definition-use-case-poc-table/${datehour}/'), 
        glue.StorageParameter.custom("EXTERNAL", 'TRUE'),
        glue.StorageParameter.custom("compressionType", 'gzip'),
      ],
      dataFormat: glue.DataFormat.JSON,
      enablePartitionFiltering: true,
      compressed: true,
    });

Possible Solution

I can think of three options to solve the bug:

  • Allow adding parameters to the tableInput and not only to the storageParameters - something like tableParameters.
  • Allow access to the node after the constructor and allow the user to move the objects in the tableInput object.
  • Add a method that will be specific to the dynamic-partitioning option, similar to the way that it is defined in a single value in Kinesis Firehose:
        extendedS3DestinationConfiguration : {
          prefix: 'events/table=!{partitionKeyFromQuery:tablename}/!{timestamp:yyyy/MM/dd}/',
          errorOutputPrefix: 'errors/!{firehose:error-output-type}/!{timestamp:yyyy/MM/dd}/',
...

Additional Information/Context

As mentioned above, this is part of common pipeline of replication from a DynamoDB table to S3 to allow analytical queries on that data from Athena. In the example above (extendedS3DestinationConfiguration) the user can define the format of the dynamic partitioning of the data in Firehose. If we fix this issue with a similar focused method (option 3 above), it will be easy to extend constructs such as KinesisStreamsToKinesisFirehoseToS3, AwsDynamoDBKinesisStreamsS3 or KinesisFirehoseToS3 to support the creation of the Glue table on top of the data in S3.

CDK CLI Version

2.99.0 (build 0aa1096)

Framework Version

No response

Node.js Version

v16.18.1

OS

MacOS

Language

Typescript

Language Version

No response

Other information

No response

@guyernest guyernest added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Sep 30, 2023
@github-actions github-actions bot added the @aws-cdk/aws-glue Related to AWS Glue label Sep 30, 2023
@indrora indrora added p2 and removed needs-triage This issue or PR still needs to be triaged. labels Oct 3, 2023
@indrora
Copy link
Contributor

indrora commented Oct 3, 2023

Can you point to the Glue docs (or CloudFormation docs for the Glue CFN) where these are described?

@indrora indrora added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Oct 3, 2023
@guyernest
Copy link
Author

Thank you @indrora for your attention.

If you check the TableInput in Glue CFN, you can see that it has Parameters and StorageDescriptor.
The StorageDescriptor CFN also has a Parameters section.

This is the source of the confusion as some of the parameters should go to the TableInput section and some to the StorageDescriptor.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Oct 4, 2023
@guyernest
Copy link
Author

Here is another link to the specific parameters that are needed for Athena: https://docs.aws.amazon.com/athena/latest/ug/partition-projection-setting-up.html

As described above about the possible options to solve it, we can add a general option to add parameters to the TableInput in Glue or to make it more specific for the parameters that are defined for the projection for Athena.

@fastrockstar
Copy link

I ran into this when I tried to add the skip.header.line.count property to the table using

storage_parameters=[glue.StorageParameter.skip_header_line_count(1)]

As you showed it was written into the wrong parameter section.

After copying it to the correct place in the template and deploying it manually, the table property was correctly configured as expected.

Thank you for fixing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-glue Related to AWS Glue bug This issue is a bug. p2
Projects
None yet
3 participants