Labour Dataset S matrix is incorrect #15

hewsond · 2023-01-16T03:58:52Z

I've been using this dataset to validate some hierarchical forecasting techniques. Applying it to the labour dataset I noticed some odd results.

The summation matrix that is present for the Labour dataset is incorrect. This can be validated with 2 simple examples:

The second row for Australia Capital Territory has 1's for NSW, VIC etc
The diagonal component of the matrix does not have the index and columns lining up correctly. E.g. ['Employed part-time', 'Males', 'Western Australia'] = 1x ['Employed part-time', 'Females', 'Australian Capital Territory']

Ex1

Ex2

Min code to replicate

from datasetsforecast.hierarchical import HierarchicalData
df, S, tags = HierarchicalData.load("datasetforecast_data_dir", 'Labour')
# Note that ACT is summed from NSW and Vic
S.iloc[:2,:5]

The correct S matrix is attached.
S_labour.csv

Code to generate above matrix from forecast df (wide format)

def generate_labor_S_Matrix_from_raw(df: pd.DataFrame) -> pd.DataFrame:
    base_ts_columns = df.columns[[len(s) == 3 for s in df.columns.str.split(',')]]

    S = np.empty((len(df.columns), len(base_ts_columns)))

    for i,col_name in enumerate(df.columns.to_list()):
        # Construct arrays of summing values by searching the base time series columns
        # Works for all rows but total
        searchable_terms = col_name.strip("[]").replace("'", '').split(',')
        searchable_terms = [t.strip() for t in searchable_terms]
        search_str = ".*".join(searchable_terms)
        S[i,:] = base_ts_columns.str.contains(search_str,regex=True)
    
    # Finally fix up the total / first row
    S[0,:] = np.ones((len(base_ts_columns),))

    S_df = pd.DataFrame(data=S, columns=base_ts_columns, index=df.columns)
    return S_df

Given that the raw dataset is actually not part of the repo what's the path towards updating this?

AzulGarza · 2023-03-13T23:46:43Z

Hey @hewsond! Thanks for letting us know about this issue. We have updated the datasets to include the summing matrix you shared. We've also added a test (#18) to ensure that S is the correct summing matrix associated with Y_df. To get the latest (right) data, remove the previously downloaded files (datasetforecast_data_dir/hierarchical). :)

jmoralez closed this as completed Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Labour Dataset S matrix is incorrect #15

Labour Dataset S matrix is incorrect #15

hewsond commented Jan 16, 2023

AzulGarza commented Mar 13, 2023

Labour Dataset S matrix is incorrect #15

Labour Dataset S matrix is incorrect #15

Comments

hewsond commented Jan 16, 2023

AzulGarza commented Mar 13, 2023