Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend ExportCfg to support full data catalogs #335

Open
ChristopherRabotin opened this issue Jul 10, 2024 · 0 comments
Open

Extend ExportCfg to support full data catalogs #335

ChristopherRabotin opened this issue Jul 10, 2024 · 0 comments
Labels
Interface: Rust Kind: New feature This is a proposed new feature

Comments

@ChristopherRabotin
Copy link
Member

High level description

Kedro has a data catalog concept, which is absolutely fantastic to use. In a way, this is what the ExportCfg does but only for saving data, and only in parquet format, and only locally.

The purpose of this ticket is to extend this to be able to load and save many files in a single catalog entry with the same time stamp to make it easy for the engineer to know when each data was generated and what matches which run.

Requirements

  • Upon config, it should allow for versioning for saving data in a timestamped folder (MVP, then can be extended to other versioning methodologies)
  • The ExportCfg should be renamed to something more relevant for loading and storing.
  • It shall support local and S3 protocol for now, nothing else.
  • It shall support credentials for S3, like in Kedro
  • It shall be possible to load many different files from a given version, contrary to Kedro's catalog.
  • It shall support reading any dataframe format that Rust's arrow crate supports (at a minimum parquet and CSV)
  • It shall be a serializable structure, either as YAML or as Dhall

Test plans

  • Replace all ExportCfg with this new approach
  • Ensure that full scenario data can be reloaded from there.

Design

This should also take inspiration from the MetaFile approach used in ANISE to download data behind URLs. I also wonder whether this should be its own crate!

use serde::{Deserialize, Serialize};
use std::collections::BTreeMap;

#[derive(Serialize, Deserialize, Debug)]
pub struct DataCatalogConfig {
    pub versioning: bool,
    pub storage: StorageConfig,
    pub credentials: Option<Credentials>,
    pub files: BTreeMap<String, Option<Box<dyn LoadedFile>>>,
}

#[derive(Serialize, Deserialize, Debug)]
pub struct StorageConfig {
    pub local_path: Option<String>,
    pub s3_path: Option<String>,
}

#[derive(Serialize, Deserialize, Debug)]
pub struct Credentials {
    pub aws_access_key_id: String,
    pub aws_secret_access_key: String,
}

pub trait LoadedFile {
    fn load(&self) -> Result<Box<dyn LoadedFile>, Box<dyn std::error::Error>>;
}
@ChristopherRabotin ChristopherRabotin added Kind: New feature This is a proposed new feature Interface: Rust labels Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Interface: Rust Kind: New feature This is a proposed new feature
Projects
None yet
Development

No branches or pull requests

1 participant