Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🔧 [Refactor] Raw data standard format #37

Open
ulfgebhardt opened this issue Apr 2, 2021 · 1 comment
Open

🔧 [Refactor] Raw data standard format #37

ulfgebhardt opened this issue Apr 2, 2021 · 1 comment

Comments

@ulfgebhardt
Copy link
Member

âš¡ Refactor ticket

We should store the raw data (json) in a standard format. I propose the one generated by the https://github.com/bundestag/scapacra-bt scraper tool. All data objects have the following structure in common:

{
  "meta": {
    "url": "https://www.bundestag.de/abgeordnete/biografien/A/517818-517818"
    ...
  },
  "data": {
    "id": "517818",
    ...
  }
}

So basically the object is split into a meta and a data part. The meta part holds at least an url field while data definies at least an id field.

Motive

We want to unify stuff and make it easy to understand and parse. Maybe you want to match laws with deputies and named polls or what not. Having a similar structure might help people do that.

Additional context

Example: https://github.com/bundestag/DeputyProfiles/blob/master/data/517818.json

image

@darkdragon-001
Copy link
Collaborator

I am not convinced by this proposal because of the following reasons (experience mostly on my work in BGBl scraper parsing a table of contents tree):

  • url: There are different URLs: web page, pdf, json data
  • id: There are different IDs: toc id, doc id, ...
  • meta vs data: I don't see which value this extra layer of indirection provides

Furthermore, different data needs to be handled differently anyways and one has to thoroughly identify what which field means anyway when working with the data. Properties with too generic names often lead to false assumptions when interpreting data.

Nevertheless, the output data structure should be documented in the README.md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants