Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Scientific mathematical notations are being read as text #488

Closed
kreuz1995 opened this issue Dec 18, 2023 · 3 comments · Fixed by #490
Closed

[BUG] Scientific mathematical notations are being read as text #488

kreuz1995 opened this issue Dec 18, 2023 · 3 comments · Fixed by #490
Labels
bug Something isn't working softwarecampus Issues related to the Softwarecampus grant

Comments

@kreuz1995
Copy link

kreuz1995 commented Dec 18, 2023

Steps to reproduce

  1. Scientific mathematical notations like 9E-2, which should actually mean 0.09, is being scanned as text in Jayvee (when tried to retrieve from a CSV file). Hence, when we keep the datatype of such a column decimal, a lot of the values are being filtered out.

Steps:

pipeline ThermoelectricMaterialsPipeline {

    block ThermoelectricMaterialsExtractor oftype HttpExtractor {
		url: "https://figshare.com/ndownloader/files/36554916";
	}

	block ZipArchiveInterpreter oftype ArchiveInterpreter {
		archiveType: "zip";
	}
	block MainThermoelectricMaterialsCSVPicker oftype FilePicker {
		path: "/TE_databases/CSV/main_tedb.csv";
	}
	block MainThermoelectricMaterialsTextFileInterpreter oftype TextFileInterpreter {
	}
	block MainThermoelectricMaterialsCSVInterpreter oftype CSVInterpreter {
		delimiter: ",";
	}
	block MainThermoelectricMaterialsTableInterpreter oftype TableInterpreter {
		header: true;
		columns: [
			"Name" oftype text,
            "Label" oftype text,
			"Editing" oftype text,
			"Model" oftype text,
			"Model_Type" oftype text,
			"Specifier" oftype text,
			"Value" oftype decimal,
			"Units" oftype text,
			"Temperature_Value" oftype text,
			"Temperature_Units" oftype text,
			"Value_Average" oftype text,
			"Temperature_Average" oftype text,
            "Pressure" oftype text,
			"Process" oftype text,
			"Direction_of_Measurement" oftype text,
			"Resolution" oftype text,
			"DOI" oftype text,
			"Title" oftype text,
			"Access_Type" oftype text,
			"Publisher" oftype text,
			"Publication_Year" oftype text,
			"Authors" oftype text,
			"Journal" oftype text,
		];
	}
	block MainThermoelectricMaterialsDatabaseLoader oftype SQLiteLoader {
		table: "MainThermoelectricMaterials";
		file: "./ThermoelectricMaterials.sqlite";
	}
	ZipArchiveInterpreter
		-> MainThermoelectricMaterialsCSVPicker
		-> MainThermoelectricMaterialsTextFileInterpreter
		-> MainThermoelectricMaterialsCSVInterpreter
		-> MainThermoelectricMaterialsTableInterpreter
		-> MainThermoelectricMaterialsDatabaseLoader;
}

Description

  • Expected: 9e-2 should be read as 0.09 (decimal datatype)
  • Actual: 9e-2 is being read as a text datatype instead.
@kreuz1995 kreuz1995 added the bug Something isn't working label Dec 18, 2023
@kreuz1995 kreuz1995 changed the title [BUG] <name> [BUG] <Scientific mathematical notations are being read as text> Dec 18, 2023
@rhazn rhazn changed the title [BUG] <Scientific mathematical notations are being read as text> [BUG] Scientific mathematical notations are being read as text Dec 18, 2023
@rhazn rhazn added the softwarecampus Issues related to the Softwarecampus grant label Dec 18, 2023
@rhazn
Copy link
Contributor

rhazn commented Dec 18, 2023

Just to be clear on wording @kreuz1995 , the values are not being read as text or decimal datatype by Jayvee. You define the datatype of a column and if you define it to be decimal, these values (that should be valid) are considered invalid by Jayvee. That is a subtle difference (but the bug exists).

For anyone solving this: The reason for this will be the scientific notation failing the regex check in https://github.com/jvalue/jayvee/blob/main/libs/execution/src/lib/types/valuetypes/internal-representation-parsing.ts#L51 , the actual parsing works fine.

@georg-schwarz
Copy link
Member

I'll play the devil's advocate: Do we really want to support that notation per default?
Another approach could be introducing a transform block in the std-lib that converts these scientific numbers into regular ones.

Not sure which design is better. How much of the logic should we move into the language vs. how much should be modeled by the user explicitly?

In this specific case, I wouldn't mind either way. But we should establish some rule-of-thumb if we can.

@rhazn
Copy link
Contributor

rhazn commented Dec 18, 2023

Interesting, I thought this would be a no-brainer but I can see how it might not be. However, in general I think we should try to be as wide with parsing as possible, e.g., if it is possible to clearly parse a value correctly, we should do so. I think we should parse 10.2 (can only be one value), 12e-2 (same) etc but not 12ef7 (could be 12, 7, 12.7, 12e7...) for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working softwarecampus Issues related to the Softwarecampus grant
Projects
None yet
3 participants