Adds multi format reader #250

domna · 2024-02-21T11:05:22Z

Fixes #213

ToDo:

Add support for config file parsing (helper in utils + default parsing)
Add support for nomad eln file parsing (helper in utils + default parsing)

Will not be implemented here

Add string parsing support from config file (as in String parsing and expression evaluation in json_map reader #252)
Add formula parsing support from config file (as in String parsing and expression evaluation in json_map reader #252)

Considerations

Probably this should replace the YamlJsonReader entirely
For that the hall and transmission reader need to be changed accordingly

domna

Hey @lukaspie, I started to work on a generic format reader. We should align here and come up with a common json config file notation and then we can do the parsing from the this reader instead directly in our plugins.

github-actions · 2024-02-21T11:09:32Z

Pull Request Test Coverage Report for Build 7988023443

Details

-17 of 30 (43.33%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.04%) to 44.526%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pynxtools/dataconverter/readers/multi/reader.py	13	30	43.33%

Totals
Change from base Build 7976686706:	-0.04%
Covered Lines:	4913
Relevant Lines:	11034

💛 - Coveralls

sherjeelshabih · 2024-02-21T11:36:46Z

This sounds close to what the JsonMapReader does. It already supports multiple formats. Maybe we can bring that in as well.

domna · 2024-02-21T12:10:30Z

This sounds close to what the JsonMapReader does. It already supports multiple formats. Maybe we can bring that in as well.

Indeed, it is very similar. The main difference I see is that the json map reader uses a mapping file where we want to use a combination file. It is a json which either contains literal values or loads data from another source via @data:... or similar.

However, the way I want to design this is that you can always change how certain formats are read, i.e., that you specify a dict of file-endings + read functions (actually, this is how the YamlJsonReader already works). That way the json map reader could also be implemented in this fashion as an implementation for reading .mapping.json + .pickle files. But we could also directly integrate mapping and pickling support (even though I would be happy to not rely on pickling at all).

sherjeelshabih · 2024-02-21T12:14:24Z

This sounds close to what the JsonMapReader does. It already supports multiple formats. Maybe we can bring that in as well.

Indeed, it is very similar. The main difference I see is that the json map reader uses a mapping file where we want to use a combination file. It is a json which either contains literal values or loads data from another source via @data:... or similar.

However, the way I want to design this is that you can always change how certain formats are read, i.e., that you specify a dict of file-endings + read functions (actually, this is how the YamlJsonReader already works). That way the json map reader could also be implemented in this fashion as an implementation for reading .mapping.json + .pickle files. But we could also directly integrate mapping and pickling support (even though I would be happy to not rely on pickling at all).

I think we should get rid of pickling eventually. I'll need to change the NXiv_temp example.

If we will have a generic reader that can be customized like this, then we should streamline it and direct users to one unified place for such functions. I'll have a thorough look later into how the YamlJsonReader works. I don't mind making big changes to bring the two together. The less clutter we have the less we need to work on and users need to figure out.

domna · 2024-02-21T12:26:56Z

This sounds close to what the JsonMapReader does. It already supports multiple formats. Maybe we can bring that in as well.

Indeed, it is very similar. The main difference I see is that the json map reader uses a mapping file where we want to use a combination file. It is a json which either contains literal values or loads data from another source via @data:... or similar.
However, the way I want to design this is that you can always change how certain formats are read, i.e., that you specify a dict of file-endings + read functions (actually, this is how the YamlJsonReader already works). That way the json map reader could also be implemented in this fashion as an implementation for reading .mapping.json + .pickle files. But we could also directly integrate mapping and pickling support (even though I would be happy to not rely on pickling at all).

I think we should get rid of pickling eventually. I'll need to change the NXiv_temp example.

If we will have a generic reader that can be customized like this, then we should streamline it and direct users to one unified place for such functions. I'll have a thorough look later into how the YamlJsonReader works. I don't mind making big changes to bring the two together. The less clutter we have the less we need to work on and users need to figure out.

Yes, this is also my main endeavour with this PR. You can check the hall or transmission reader to see an implementation using this (I think it's easier to grasp from there). Here is the actual reader for hall:

class HallReader(YamlJsonReader):
    """HallReader implementation for the DataConverter
    to convert Hall data to Nexus."""

    supported_nxdls: List[str] = ["NXroot"]
    extensions = {
        ".txt": lambda fname: parse_txt(fname, "iso-8859-1"),
        ".json": parse_json,
        ".yml": lambda fname: parse_yml(fname, None, None),
        ".yaml": lambda fname: parse_yml(fname, None, None),
    }

I think it's easy and straightforward to understand and all of the parsing is offloaded into the parse_... functions.

The current plan with this MultiFormatReader is the following:

We allow using an extensions dict in an inheriting reader, which maps file ending to parse function.
The reader should pickup common functionality for standard file endings (json config files, nomad eln yaml's), which can be overridden with the extensions dict (we already have a parse_json helper in pynxtools.dataconverter.readers.utils so readers can also re-use them along other files. We should also add one for yaml and set them as default in MultiFormatReader)
We should advocate this generic reader for most cases and the base reader if you want to do something special

coveralls · 2024-02-22T07:49:50Z

Pull Request Test Coverage Report for Build 8689694136

Details

66 of 155 (42.58%) changed or added relevant lines in 4 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-1.6%) to 76.294%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pynxtools/dataconverter/template.py	18	21	85.71%
pynxtools/dataconverter/readers/utils.py	4	9	44.44%
pynxtools/dataconverter/readers/multi/reader.py	43	124	34.68%

Totals
Change from base Build 8647780955:	-1.6%
Covered Lines:	2668
Relevant Lines:	3497

💛 - Coveralls

lukaspie · 2024-02-22T15:19:50Z

FWIW, I have taken quite a different approach in the XPS reader, with different classes for the individual sub-readers for different formats and the selection based on the file format happening in a different script. So if we implement this, I will have to change the XPS reader quite a bit. I would be happy to do so if we get more aligned across the different readers and if the config/eln file parsing is unified.

domna · 2024-02-22T15:35:17Z

FWIW, I have taken quite a different approach in the XPS reader, with different classes for the individual sub-readers for different formats and the selection based on the file format happening in a different script. So if we implement this, I will have to change the XPS reader quite a bit. I would be happy to do so if we get more aligned across the different readers and if the config/eln file parsing is unified.

I think we might be able to support both. As far as I understood it from a first glance you keep a dict with a reader class. So we could use this in the multi format reader to instantiate the class and invoke its read function whenever we find a class instead of a function. I think this is also a nice feature, that you can either specify a simple read function or a complete reader.

But I would anyways keep the config parsing as a separate function. So you could also just reuse this in your code.

lukaspie · 2024-02-22T15:54:08Z

FWIW, I have taken quite a different approach in the XPS reader, with different classes for the individual sub-readers for different formats and the selection based on the file format happening in a different script. So if we implement this, I will have to change the XPS reader quite a bit. I would be happy to do so if we get more aligned across the different readers and if the config/eln file parsing is unified.

I think we might be able to support both. As far as I understood it from a first glance you keep a dict with a reader class. So we could use this in the multi format reader to instantiate the class and invoke its read function whenever we find a class instead of a function. I think this is also a nice feature, that you can either specify a simple read function or a complete reader.

But I would anyways keep the config parsing as a separate function. So you could also just reuse this in your code.

Yeah, it's not actually a "reader" that's called in the sense that it inherits from BaseReader 😄 I have a three-layer structure: there is the XpsReader(BaseReader), then I have a layer that I call "mappers" (see here for an example) and these mappers themselves actually call "parsers" (example). The mappers are used for one file format (like the sle format from SPECS) and then I have a logic that calls a parser for a specific subsets of such files, e.g. depending on the software version that was used for this file. That allows me to keep functionality across different, yet very similar versions of a format by inheritance and abstract base classes.

Now that I think about it, maybe all of those sub-classes could be readers themselvers, inheriting from our BaseReader class (or the MultiFormatReader). In any case, having the option to instantiate a "reading"/"mapping"/"parsing" class and invoke its top-level (aka read) function through the extension dict would be nice. We just have to make sure that these functions are always called the same.

I additionally have the problem that the extension is often not unique, i.e., many vendors have a .txt export, but all the files are actually different. But this logic could probably be handled by passing a function in the extensions dict that does this. I am actually doing that already. And finally, I also have a check that the file comes from the list of supported vendors.

sherjeelshabih · 2024-02-23T11:37:19Z

There is also an idea to bring a unified json config and mapping to the dataconverter itself. This is still under discussion but I'm leaving this here as a reminder.

domna · 2024-02-23T15:13:55Z

@sherjeelshabih and @lukaspie, I finished the first draft of the MultiFormatReader now. I did not test it yet but I think we can use this as a basis to discuss about the structure.

The general mechanism is as detailed above, that we use the extensions dict to figure out which extension is handled by which function and the reader just glues everything together. I guess we could also introduce a testing function in this dict to distinguish different formats if you find this helpful (e.g., when .txt files with different syntax are used).

There are also two special functions now: setup_template just fills in keys before anything is done (for fixed info like reader name and version and handle_objects, which handles the objects passed to the reader function.

What this reader supports is a config dict with some special syntax which is noted as, e.g., @attrs:... in the values of the dict. The config dict itself is always reduced to a flat dict, hence users can group it any way they want. We support the following special keywords:

@attrs: Load an attribute from the input data, i.e., this is coming from somewhere else in the complete data passed to the reader. We use this in the mpes reader to load data from metadata in an hdf5 file. It takes a path, i.e., something like @attrs:/my_metadata/entry. The MultiFormatReader has a function get_attr, which takes a path and returns the data for that field. The sub-reader would then implement this function to provide the actual data to write.
@data: Works similarly to @attrs, but for data entries. This is split in the mpes reader because we retrieve data from an xarray. It is resolved by the get_data function in the reader. Maybe this can also be aligned with the @attrs keyword.
@link: This constructs a link to the provided path. It is the replacement for the {'link': '/path'} syntax we cannot use anymore because of the flattening
There is also some special syntax with wildcard (*) replacements, this is what the get_data_dims function is for. It gives a list of all data axes available to replace the wildcard. We use this to fill axisnames automatically from xarray dims in the mpes reader (see here for an usage example)
We also support a wildcard with a name to repeat entries with different names (e.g., *{my, name, etc} replaces to three keys with my, name, etc replaced, respectively). Here you can see it in action for the lens voltage readout in mpes (same mapping but different lens name).

Here is the mpes example config file as an implementation of the config file syntax above.

I also had a look at the json map reader and from my view it basically does a @attrs on each value in comparison to the notation above. So we could just do an "auto-@attrs" for the mapping. The question is whether it would also be feasible to use the @link in the mapping and I think we would need to come up with a replacement for the slicing syntax you use since this is embedded in the shape key of a nested dict (which the config reader would flatten away).

@lukaspie Does this roughly align with the config file you use for xps?
Do you think we should move this as a standard functionality in the BaseReader? People could just override the read function if they don't want the dict behaviour and ignore everything else or they can extend the list with their special parsing for a particular file format.

pynxtools/dataconverter/readers/multi/reader.py

lukaspie · 2024-03-04T11:03:27Z

pynxtools/dataconverter/readers/multi/reader.py

+                print(f"WARNING: File {file_path} does not exist, ignoring entry.")
+                continue
+
+            template.update(self.extensions.get(extension, lambda _: {})(file_path))


Alternative: I have a function that only updates if there is nothing overwritten that already exists in the template.

Let's discuss this. In my view it might be useful sometimes to allow overwriting, e.g., with the json dict. But I could also imagine giving the reader a flag to switch this behaviour and set one of the options as default.

lukaspie · 2024-03-04T11:25:45Z

What I don't understand: how are multiple entries handled?

In the XPS reader, I need to always change the template key from "/ENTRY[entry]/" (which is in the config) to "/ENTRY[entry_1]/", "/ENTRY[entry_2]/", and so on. The same is actually true for links, where I put "@link:/entry/instrument/beam_probe" in the config and then the actual links are linking within the same entry (i.e., to /entry_1/instrument/beam_probe.

Basically I need to iterate the config file and replace the key for each entry. I can do that inside the different functions, e.g. in get_data, but maybe there could be a more sophisticated version right at a higher level.

domna · 2024-03-04T11:38:19Z

What I don't understand: how are multiple entries handled?

In the XPS reader, I need to always change the template key from "/ENTRY[entry]/" (which is in the config) to "/ENTRY[entry_1]/", "/ENTRY[entry_2]/", and so on. The same is actually true for links, where I put "@link:/entry/instrument/beam_probe" in the config and then the actual links are linking within the same entry (i.e., to /entry_1/instrument/beam_probe.

Basically I need to iterate the config file and replace the key for each entry. I can do that inside the different functions, e.g. in get_data, but maybe there could be a more sophisticated version right at a higher level.

Yes, this is a good comment. I also thought about this already but did not put it yet, because we just did not have this functionality in the mpes reader/config.
We could also have some kind of wildcard syntax or just let the reader replace /ENTRY[entry]/ with the appropriate entry name (that way your json dict is a template for a single entry but your reader will create multiple entries based on this template).

lukaspie · 2024-03-04T11:39:11Z

We could also have some kind of wildcard syntax or just let the reader replace /ENTRY[entry]/ with the appropriate entry name (that way your json dict is a template for a single entry but your reader will create multiple entries based on this template).

This is how I currently have it.

lukaspie · 2024-03-04T11:55:12Z

Eventually, bringing in this functionality that Rubel wants to implement anyway (see #255) would be very good as well.

lukaspie · 2024-03-05T10:12:57Z

As a summary to our meeting yesterday, here's what I think would be nice to implement:

Distinguish config parsing from json data parsing. Add a --config-file option to the dataconverter.
Default config files. Either let the reader (or a sub-reader) set the config file and parse it directly from the reader (like in pynxtools-xps) or allow a user to store their default experiment configuration somewhere.
Make the ELN yaml file another data file that gets activated with the token @eln.
Allow lists as config value to provide different options depending if the data exists in the specified position. As an example, for [@attrs, @data, @eln, "0.0"], the converter would check (in order) the @attrs, @data, and @eln tokens and write the respective value if it exists there. If not, it defaults to "0.0".
Entry renaming if there are multiple. Another thought I had: maybe this could also be done through a wildcard. That is, if I have "/ENTRY[entry*]/" in my config list and provide a list of entry names, there will be an entry for each of these. And if I just use "/ENTRY[entry]/", there is only one entry called entry. This could also work e.g. for multiple detectors ("/ENTRY[entry*]/DETECTOR[detector*]/").
Link renaming if there are multiple. On the value side, the same syntax should work for links, i.e, replace @link: entry*/my_field in the config by {@link: entry_1/my_field} in the template.
Should files that come later in the sorting by allowed to overwriting existing data in the template? Could also be implemented as a flag maybe? Probably related to the "Allow lists as config value" issue.
Optional: automatically write default attribute to immediate child groups (Adding default attribute to immediate child group #255) or at least have a function that can do this (like this).

Feel free to extend if you have anything else.

domna self-assigned this Feb 21, 2024

domna commented Feb 21, 2024

View reviewed changes

domna mentioned this pull request Feb 21, 2024

String parsing and expression evaluation in json_map reader #252

Open

domna force-pushed the multi-format-reader branch from 0f7644f to adedc85 Compare February 22, 2024 07:46

lukaspie mentioned this pull request Feb 22, 2024

Add igor support FAIRmat-NFDI/pynxtools-mpes#2

Open

domna force-pushed the multi-format-reader branch from 7ef301c to 44c4928 Compare February 23, 2024 09:37

lukaspie mentioned this pull request Feb 23, 2024

Clean up reader-mapper-parser infrastructure, use multiformat reader FAIRmat-NFDI/pynxtools-xps#19

Open