Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve PerTableConfig's extensibility and make it format agnostic #297

Open
ashvina opened this issue Dec 19, 2023 · 17 comments
Open

Improve PerTableConfig's extensibility and make it format agnostic #297

ashvina opened this issue Dec 19, 2023 · 17 comments

Comments

@ashvina
Copy link
Contributor

ashvina commented Dec 19, 2023

This enhancement request stems from 1) the conversation in #293 and 2) emerging needs such as #296. The intent is to clean up the PerTableConfig.

OneTable's sync flow is based on PerTableConfig, which is the input configuration provided by the user. The sync process translates metadata of a single source table into one or more target tables. The user must provide this config for the translation process to be successful.

This image below illustrates the current structure of PerTableConfig.
image

However, the use cases have changed over time and now require more flexibility and compatibility with different table formats. A different location may be required for generating the metadata of the target table. In that case, the path to that location should also be provided. Additionally, the target table may have a connection to another catalog instance. Which means that the target table requires not just a format identifier, but also some of the configurations that are currently provided for a source table only.

The current configuration object includes some configurations that are specific to Iceberg and Hudi formats. These configurations should be wrapped by input configuration instances that are specific to each format.

The following image shows the proposed PerTableConfig.
image

  1. A better way to name a PerTableConfig is a TableSyncConfig, because it is a configuration for synchronizing a table.
  2. Separate the configurations for the sync task, common table configs, and configs specific to formats.
  3. Instead of using only a format identifier, represent target table formats as a table. Create a separate entity called ExternalTable that can be either a source table or a target table. The ExternalTable clearly differentiates between internal representation and external table.

A possible class structure for representing the table config is the topic I want to discuss.

@ashvina
Copy link
Contributor Author

ashvina commented Dec 19, 2023

cc: @the-other-tim-brown @lmccay

@lmccay
Copy link
Contributor

lmccay commented Dec 19, 2023

I like the idea of cleaning this up. I've been struggling what the "perTable" actually means in the name of this. It doesn't really have a per table cardinality within it and there isn't one provided per table in the sync.

Why is this not just considered a sync or job context?
Maybe we could even hide the optional need for Hadoop Configuration within that context instead of being a separate config that has to be passed around.

I have questions about the class diagram:

  1. It seems to represent actual TargetTable's rather than the intended formats of those tables which are what is in the config today and the actual tables are realized at sync time rather than in the config. I guess it is metadata representation of a target table given the other classes there.
  2. What are the implementation classes SourceTable and TargetTable are they what are clients now and they consume their respective config objects? Should SourceTables be able to access TargetTable' config?
  3. I guess each table implementation would have access to all of the syncConfig since it is provided either at class creation time via ctor or via the init method that we may be adding?
  4. Do SourceTables really not have anything different to be configured from TargetTables? If not, do we really need both as external table concepts? They are all Tables and one is just designated as a Source for this job. Although, there are separate SourceClients so I wonder how we can be sure they don't/won't need different config.

@lmccay
Copy link
Contributor

lmccay commented Dec 19, 2023

I'm picturing a flow where there is a SyncSubmit class that gets handed a config like the above which comes from a YAML file or programmatically assembled then uses the factory to instantiate the various clients and inject the client specific config before init(). Limits the scope of what each class can see.

Each targetClient gets a TargetClientConfig
Each sourceClient gets a SourceClientConfig

@ashvina
Copy link
Contributor Author

ashvina commented Dec 19, 2023

It doesn't really have a per table cardinality within it.

This is unclear to me. A single source table implies a perTable cardinality. The goal is to convert the commit history of one source table into an intermediate format, and then create one or more versions of different target tables.

Why is this not just considered a sync or job context?

Agree, I too see it as a job config. Hence my proposal was to call it a PerTableSyncConfig.

It seems to represent actual TargetTable's rather than the intended formats...I guess it is metadata representation of a target table given the other classes there.

A target table identifier alone is not enough, as #296 shows. For example, the input should specify where to create the target table. In other cases, the target table format may need catalog configuration. Hence the proposal to abstract it like a table.

SourceTable and TargetTable ... are they what are clients now

Source and target tables are representations of table configs provided by user in per table config. They will be used for instantiating Delta, Hudi, or Iceberg clients. I guess a better name should be searched for this.

Do SourceTables really not have anything different to be configured from TargetTables? If not, do we really need both as external table concepts?

They are different. For simplicity I did not put details there. Configs like target metadata retention belong to Target tables only.

@lmccay
Copy link
Contributor

lmccay commented Dec 19, 2023

@ashvina - perhaps the injection should actually be done at the property level. Each client provides a setter for each param that it requires and the SyncSubmit code would pull all the poperties from the relevant config and inject them as needed. Leaves the syncconfig object out of the clients altogether.

Thoughts?

@ashvina
Copy link
Contributor Author

ashvina commented Dec 19, 2023

I agree with you. The sync task's config model should not depend on how it is configured.

By the way, about the terminology, in OneTable, the implementations that read/write (source and targets) the table format's metadata are called clients. The code that invokes the sync tool are called utils, e.g. RunSync. The utils are mostly unaware of the client implementation at the moment.

@lmccay
Copy link
Contributor

lmccay commented Jan 6, 2024

@ashvina and @the-other-tim-brown - I've just been playing around with rolling our own injection mechanism to address PerTableConfig. I think that this will greatly simplify what we need to do. By doing this, we eliminate the need for the PerTableConfig interface in the api module and therefore the implementation specifics as well. By following the below steps within the Factory class, all of the required common and implementation specific properties are set in between the ServiceLoader creating the uninitialized client and the call to init(). The knowledge of all those properties remain in the core module and the clients and the clients can further process those within init() if needed.

  1. Use reflection to determine all of the getters in the PerTableConfig interface (this interface moves back to core).
  2. Convert the getter method names into setters for the targetClient methods that may or may not exist on each client.
  3. Get the value from the PerTableConfigImpl object and set it on each client where a setter exists.

This allows for extensibility of the getters and setters as new clients are added and new config is required. It allows for the client specific interfaces to remain unknown by the factory code with reflection based casting being used for the implementation specific setters (this was a tricky part). It also may mean that we don't have to change PerTableConfig much, if at all. We can let it represent the configured sync job and not worry about separating it into specific format configs or the like.

Here is some poc code that represents the injection stuff that I describe above. It will need to be productized and cleaned up but shows what I am talking about.

private void injectPerTableConfigAsNeeded(PerTableConfig perTableConfig, TargetClient targetClient) {
try {
  Class cls = Class.forName("io.onetable.client.PerTableConfig");

  Method[] methods = cls.getMethods();
  for (int i = 0; i < methods.length; i++) {
    Method method = methods[i];

    String targetMethodName = method.getName().replaceFirst("get", "set");
    System.out.println("-----");
    try {
      Statement stmt = new Statement(targetClient.getClass().cast(targetClient),
              targetMethodName,
              new Object[]{getConfigValue(perTableConfig, method)});
      stmt.execute();
    } catch (NoSuchMethodException nsme) {
      System.err.println(nsme);
    }
  }
} catch (Throwable e) {
  System.err.println(e);
}
}

private Object getConfigValue(PerTableConfig perTableConfig, Method method) {
try {
  Class cls = perTableConfig.getClass();

  String targetMethodName = method.getName().replaceFirst("get", "set");
  System.out.println("-----");
  Expression expression = new Expression(perTableConfig, method.getName(), new Object[0]);
  expression.execute();
  return expression.getValue();
} catch (Throwable e) {
  System.err.println(e);
}
return null;
}

The above would be called from the existing factory method as such - notice that PerTableConfig no longer needs to be in the TargetClient interface:

public TargetClient createForFormat(
  String tableFormat, PerTableConfig perTableConfig, Configuration configuration) {
TargetClient targetClient = createTargetClientForName(tableFormat);

injectPerTableConfigAsNeeded(perTableConfig, targetClient);

targetClient.init(configuration);
return targetClient;
}

I'm wondering whether we should investigate the injection before the refactoring of the PerTableConfig to see what is still needed after. What do you think?

@the-other-tim-brown
Copy link
Contributor

I will need some time to think through the sample you provided. My gut is saying to try to tackle the PerTableConfig and cleanup of some of these format specific things leaking into the common interfaces first to better understand how we want to do injection.

@lmccay
Copy link
Contributor

lmccay commented Jan 9, 2024

@the-other-tim-brown - sure. Check out the PR #307 as a POC implementation.
It works nicely and removes the need for all those other interfaces in the api module.

I think if the injection filters out the format specifics then the config representing the submitted sync can be mixed just as the yaml would be. But I am up for adjusting it to whatever we end up with.

@lmccay
Copy link
Contributor

lmccay commented Jan 12, 2024

@ashvina , @the-other-tim-brown - can we step back and revisit problem statement for the PerTableConfig refactoring here?
I think that #293 related topics regarding format specific details leaking into various interfaces can be solved by injection like in #306 but we need to maybe still discuss the #296 motivation as well.

If we can limit the exposure of PTC to the Factory then it can more closely resemble the sync details and not worry about the table format specifics being in there.

I do think that certain implementation classes themselves being used inside PTC itself for things like adding default implementations, etc should maybe be moved into the TargetClients themselves. The PTC itself should maybe just contain config values rather than object references as well. The consumers could then react to null or missing values and create the defaults themselves for instance.

What is anything about #296 do we still need to consider in order for the metadata to be able to be located else some other location? Is this just additional format specific config values for location?

@lmccay
Copy link
Contributor

lmccay commented Jan 15, 2024

@ashvina , @the-other-tim-brown - it does occur to me that there is still a lack of extensibility in the config (even with the injection) due to the fact that the configuration details are built into the config class interface. If we were to add the ability to provide additional name/value pairs then you could add an arbitrary targetClient and specific config to PerTableConfig through that generic mechanism. Maybe with target specific prefixes to the params.

Injection would also be able to handle those params.

@ashvina
Copy link
Contributor Author

ashvina commented Jan 16, 2024

@lmccay. I don't think we should mix injection and extensible of PTC in one issue. PTC should be extensible regardless. Let's focus only on PTC here.
Regarding #296, I suggest we create an entity for external tables. We should treat both source and target tables as complete entities, so the core logic doesn't have to infer or assume anything. The main client, such as RunSync, should handle those usability shortcuts.

@the-other-tim-brown
Copy link
Contributor

I am off from work this week so I will be able to put up some suggestions in the form of PRs for breaking down the existing PerTableConfig and making it extensible for users interacting directly calling the sync method.

One option is to move towards a model like Hudi uses where you have a Properties object that is essentially a key-value map with defaults. Then each implementation can add their own properties and reuse existing ones. This provides the most flexibility but may not be as clear to developers which properties are required or not.

@lmccay
Copy link
Contributor

lmccay commented Jan 17, 2024

@the-other-tim-brown , @ashvina - I am very used to the hadoop style config and therefore like that idea.
One thing we do there is provide a pattern for access to all params that start with a given prefix. That doesn't help with determining the required params up front but it does help with name collision type things and filtering to only the configs you need.
You could always add that sort of method in a Properties extension or the like.

@lmccay
Copy link
Contributor

lmccay commented Jan 17, 2024

@lmccay. I don't think we should mix injection and extensible of PTC in one issue. PTC should be extensible regardless. Let's focus only on PTC here. Regarding #296, I suggest we create an entity for external tables. We should treat both source and target tables as complete entities, so the core logic doesn't have to infer or assume anything. The main client, such as RunSync, should handle those usability shortcuts.

@ashvina - I didn't mean to suggest that injection should be part of this, sorry. I was just pointing out that solving the leaking of table specifics with something like injection doesn't address the extensibility of PTC itself.

Can you explain what external tables are, if they are not source and target tables? Probably bumping up against my ignorance here.

@the-other-tim-brown
Copy link
Contributor

I've created a draft that adopts the model @ashvina proposed above and gets rid of format specifics from the table configs #312

@the-other-tim-brown
Copy link
Contributor

@lmccay. I don't think we should mix injection and extensible of PTC in one issue. PTC should be extensible regardless. Let's focus only on PTC here. Regarding #296, I suggest we create an entity for external tables. We should treat both source and target tables as complete entities, so the core logic doesn't have to infer or assume anything. The main client, such as RunSync, should handle those usability shortcuts.

@ashvina - I didn't mean to suggest that injection should be part of this, sorry. I was just pointing out that solving the leaking of table specifics with something like injection doesn't address the extensibility of PTC itself.

Can you explain what external tables are, if they are not source and target tables? Probably bumping up against my ignorance here.

@lmccay the "external" was used in the original description to distinguish between the internal data model for a table. The diagram at the top of the issue shows what it is meant to represent. External may not be the best term to use since data warehouses have external tables as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants