Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent splits naming between Raw and Epochs when using BIDS schema #11870

Closed
dmalt opened this issue Aug 8, 2023 · 11 comments · Fixed by #11876 or #11884
Closed

Inconsistent splits naming between Raw and Epochs when using BIDS schema #11870

dmalt opened this issue Aug 8, 2023 · 11 comments · Fixed by #11876 or #11884
Labels

Comments

@dmalt
Copy link
Contributor

dmalt commented Aug 8, 2023

Description of the problem

Not really a bug, more of an inconsistent behaviour.
Saving large Raw and Epochs data with split_naming='bids' uses different naming schema.

When I save large Raw, I get suffixes split-01 and split-02, etc. but when saving Epochs, the first split is saved without suffix, while the second one has split-01. See the code snippet below for details. All asserts pass for me.

Is there a reason for such difference?

Steps to reproduce

import mne
from pathlib import Path

sample_data_folder = mne.datasets.sample.data_path()
sample_data_raw_file = sample_data_folder / "MEG" / "sample" / "sample_audvis_raw.fif"
raw = mne.io.read_raw_fif(sample_data_raw_file, verbose=False).crop(tmax=60)
ev = mne.find_events(raw)
epochs = mne.Epochs(raw, ev, tmin=-0.3, tmax=0.7)

raw.save("raw_meg.fif", split_naming="bids", split_size="50MB")
epochs.save("epochs_epo.fif", split_naming="bids", split_size="50MB")

# all these asserts pass
assert not Path("raw_meg.fif").is_file()
assert Path("raw_split-01_meg.fif").is_file()
assert Path("raw_split-02_meg.fif").is_file()

assert Path("epochs_epo.fif").is_file()  # this should fail though
assert Path("epochs_split-01_epo.fif").is_file()
assert not Path("epochs_split-02_epo.fif").is_file()  # this should fail as well

Link to data

No response

Expected results

These two asserts should fail

assert Path("epochs_epo.fif").is_file()
assert not Path("epochs_split-02_epo.fif").is_file()

Actual results

All asserts pass

Additional information

Platform macOS-13.4.1-arm64-arm-64bit
Python 3.11.3 (main, May 15 2023, 18:01:31) [Clang 14.0.6 ]
Executable /Users/dmalt/Applications/miniconda3/envs/mne_cli_tools/bin/python3
CPU arm (8 cores)
Memory Unavailable (requires "psutil" package)
Core
├☑ mne 1.4.2
├☑ numpy 1.25.2 (OpenBLAS 0.3.23.dev with 8 threads)
├☑ scipy 1.9.3
├☑ matplotlib 3.7.2Installed osx event loop hook.
(backend=MacOSX)
├☑ pooch 1.7.0
└☑ jinja2 3.1.2

Numerical (optional)
├☑ sklearn 1.3.0
├☑ pandas 1.5.3
└☐ unavailable numba, nibabel, nilearn, dipy, openmeeg, cupy

Visualization (optional)
└☐ unavailable pyvista, pyvistaqt, ipyvtklink, vtk, qtpy, ipympl, pyqtgraph, mne-qt-browser

Ecosystem (optional)
└☐ unavailable mne-bids, mne-nirs, mne-features, mne-connectivity, mne-icalabel, mne-bids-pipeline

@dmalt dmalt added the BUG label Aug 8, 2023
@larsoner
Copy link
Member

larsoner commented Aug 8, 2023

Seems like a bug to me. @sappelhoff WDYT?

@sappelhoff
Copy link
Member

Thanks for the report! I agree, this looks like a bug.

Furthermore, a minor point: the indexes (01, 02, ... ) do not really need to be prefixed with a 0 if they never exceed 9 -- and I guess we can know that before writing the files and adjust the behavior accordingly.

@dmalt
Copy link
Contributor Author

dmalt commented Aug 8, 2023

Thanks for such a quick response from both of you!

Frankly, now that I think of it, I like the behaviour of Epochs better than that of Raw, although I suspect it was a conscious design choice in case of Raw.

Picture this. You have to write a data processing pipeline for a new experiment for which only part of the data were recorded. So far all the recorded data are under 2GB, so you don't have to worry about splits. Then the new data arrive and for one subject the file size is above 2GB, so when you save the intermediate preprocessed copy, the saved file will have the split- suffix. Now all the other scripts that are trying to access this file are broken and should be changed. If the intermediate steps don't reduce the file size, like filtering, the result for those steps will also have the split- suffix after saving, and so on. The situation gets even worse if you're using some sort of text-based templating framework for the filenames (like hydra). This actually happened to me once and it was really annoying to fix.

One way to avoid this would be to use the old "neuromag" split naming, but in this case I think BIDS checkers start complaining.

Instead if the main split didn't change its name, like for Epochs now, everything would have worked the same way.
I'm not sure though, what BIDS specification has to say about this matter and what was the original thinking process behind the splits behaviour for Raw as it is now.

What do you guys think?

@sappelhoff
Copy link
Member

Instead if the main split didn't change its name, like for Epochs now, everything would have worked the same way.
I'm not sure though, what BIDS specification has to say about this matter and what was the original thinking process behind the splits behaviour for Raw as it is now.

The BIDS specification says when a file is split, all parts of that file need to contain the split suffix -- so the behavior as it is for raw currently is the BIDS conformant behavior.

It is RECOMMENDED that .fif files with multiple parts use the split- entity to indicate each part.

https://bids-specification.readthedocs.io/en/latest/glossary.html#split-entities

I understand where you are coming from, but I think you'd have to build in a little logic into your reading code to deal with this problem, and/or rely on BIDS reading software like mne-bids. Although admittedly, we do have open issues about split files there, see:

@dmalt
Copy link
Contributor Author

dmalt commented Aug 9, 2023

The BIDS specification says when a file is split, all parts of that file need to contain the split suffix -- so the behavior as it is for raw currently is the BIDS conformant behavior.

Ok, I see, Epochs.save needs to change then. I can try making a PR .

...and/or rely on BIDS reading software like mne-bids

I don't think mne-bids supports splits reading and writing with the same template at the moment. I would partly solve the problem. For example, this fails:

from pathlib import Path
from mne_bids import BIDSPath, read_raw_bids
from mne.io import read_raw
from mne.datasets import sample

root_path = Path("./tmp")
subj_path = root_path / "sub-01" / "meg"
subj_path.mkdir(parents=True)

raw_fpath = sample.data_path() / "MEG" / "sample" / "sample_audvis_raw.fif"
raw = read_raw(raw_fpath)
raw.crop(0, 10)

bp = BIDSPath(subject="01", root=root_path, suffix="meg", extension=".fif", datatype="meg")
raw.save(bp.fpath, split_size="10MB", split_naming="bids")
# produces tmp/sub-01/meg/sub-01_split-01_meg.fif, tmp/sub-01/meg/sub-01_split-02_meg.fif

loaded_raw = read_raw_bids(bp)  # fails with
# FileNotFoundError: File does not exist:
# tmp/sub-01/meg/sub-01_meg.fif
# Did you mean one of:
# sub-01_split-02_meg.fif
# sub-01_split-01_meg.fif
# instead of:
# sub-01_meg.fif

Something like this should work though, both with and without splits

loaded_raw = read_raw_bids(sorted(bp.match(), key=str)[0])

(sorted(..., key=str) is required, since we want to pick the first split, but BIDSPaths are not comparable)

mne-tools/mne-bids#1064

Splits are a source of never ending joy, it seems :)
Now that I remember, I actually had a splits-related issue with mne-bids two years ago:
mne-tools/mne-bids#712

@drammock
Copy link
Member

drammock commented Aug 9, 2023

ouch. it's really bad that raw.save(my_bidspath) generates filenames that can't be read by read_raw_bids(my_bidspath).

@dmalt
Copy link
Contributor Author

dmalt commented Aug 9, 2023

BTW, is it possible to modify saving raw files in .fif.gz so all the splits live inside the same archive to avoid having splits altogether?
I think "parse, don't validate" rule applies here: it might be better to convert to a format that doesn't use splits once than to think about this corner case each time. Just a hypothetical thought :)

EDIT.
I guess, gunzip doesn't allow merging the splits together. It would have to be a tarball, so something like .fif.tar.gz instead. But still worth a thought.

@agramfort
Copy link
Member

agramfort commented Aug 12, 2023 via email

@dmalt
Copy link
Contributor Author

dmalt commented Aug 12, 2023

it's not what .gz (gzip) does. It only compresses one file in mne (to make it smaller).

I see. Somehow I thought gzip was like zip archives.

I'm definitely not suggesting moving away from .fif. Merely having an option to wrap it into archive as a potential solution to splits. You already do the compression/decompression in io functions, when filename ends in .fif.gz. The same way you could wrap/unwrap split files (or a single file) in a tarball, when the extension is .fif.tar.gz or .fif.tar. It's not gonna solve the splits problem completely, because you can't really drop the support for the vanilla .fif, so the splits bugs will reappear from time to time. But if tarballs are advertised in the docs as a recommended way of dealing with splits, a lot of these bugs will go away for the end user. .fif splits are not meant to be treated as separate files: any time you move them apart or even rename them, everything breaks. It won't happen if they are inside an archive.

I realise there are issues with this approach but still thought it was worth suggesting. Don't know. Maybe it's all nonsense and will only complicate things further.

@agramfort
Copy link
Member

agramfort commented Aug 13, 2023 via email

@dmalt
Copy link
Contributor Author

dmalt commented Aug 16, 2023

Yes, that's true, this will break preload=False since there's no way to know where the second split starts in a tar archive without unarchiving the whole thing. Maybe with zip it's possible, since it has index. I'm not sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants