Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate lines in atomic database #384

Open
1 of 2 tasks
AlexHls opened this issue Oct 12, 2023 · 1 comment
Open
1 of 2 tasks

Duplicate lines in atomic database #384

AlexHls opened this issue Oct 12, 2023 · 1 comment

Comments

@AlexHls
Copy link
Member

AlexHls commented Oct 12, 2023

Describe the bug

The atomic data base generated from the quickstart notebook (i.e. the TARDIS quickstart database kurucz_cd23_chianti_H_He) contains duplicate lines_data/ lines entries. This can lead to issues, e.g., when reindexing the database (see TARDIS #2442).

To Reproduce

import pandas as pd
from tardis.io.atom_data.util import download_atom_data

download_atom_data('kurucz_cd23_chianti_H_He')
store = pd.HDFStore('/path/to/kurucz_cd23_chianti_H_He.h5')

def check_duplicates(df, verbose=False):
    dup_idx = df.index[df.index.duplicated()]
    not_identical = 0
    for idx in dup_idx:
        data = df.loc[idx]
        assert len(data) > 1, "Ups"
        identical = True
        for i in range(len(data) - 1):
            # Line ID will obviously be different
            data_a = data.iloc[i].drop(labels="line_id")
            data_b = data.iloc[i+1].drop(labels="line_id")
            identical = data_a.equals(data_b)
            if not identical:
                if verbose:
                    print(data.loc[idx])
                not_identical += 1
    if not_identical > 0:
        raise ValueError("Not all data is identical! (%d not identical)" % not_identical)

check_duplicates(store["lines"])

Screenshots

System

  • OS:

    • GNU/Linux
    • macOS
  • Environment (conda list):
    (Default carsus env)

Additional context

@andrewfullard
Copy link
Contributor

I'm not sure why this code raises a ValueError when the data are not identical, when I assume it should be looking the count of duplicates. I tried this with a newly-generated output from Carsus and there don't seem to be duplicates (hard to say) so regenerating the basic TARDIS atom data seems like a good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants