Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement scraper for PA register #10

Open
MthwRobinson opened this issue Dec 2, 2020 · 5 comments
Open

Implement scraper for PA register #10

MthwRobinson opened this issue Dec 2, 2020 · 5 comments
Assignees
Labels
feature New feature or request new_state Adding a new state to the project

Comments

@MthwRobinson
Copy link
Contributor

The goal of this issue is to implement the code for downloading regulations from the Pennsylvania register and converting the output to XML. For now, you can store the resulting XML files in a local directory. The directory should use the following convention: pa/{volume}/{number}/{regulation_number}.xml.

The Pennsylvania register has the following disclaimer. Since we're not reselling the data or using it for profit, we should be good to go.

No part of the information on this site may be reused for profit or sold for profit.

@MthwRobinson MthwRobinson added feature New feature or request new_state Adding a new state to the project labels Dec 2, 2020
@kvnkho
Copy link
Contributor

kvnkho commented Dec 3, 2020

Does a parser really need to be to handle XML or can we do it in JSON then convert?

@kvnkho kvnkho assigned kvnkho and wbcai and unassigned kvnkho Dec 3, 2020
@MthwRobinson
Copy link
Contributor Author

The reason for the XML is:

  1. A lot of the regulations (like this one) have strike throughs and other markdown formatting that would be useful to maintain.
  2. I'd like to enforce use of a common schema, which is a little easier in XML.
  3. I find XML to be more human readable;

That said, I am planning on implementing a helper function that converts a dict to XML, so if you can get all the data into a dictionary, you an just call that.

@kvnkho
Copy link
Contributor

kvnkho commented Dec 5, 2020

Got it that makes sense. Yeah, I was wondering about the VA code was going to be re-written or just converted. If converted, that helper function would be useful across the registers.

@MthwRobinson MthwRobinson changed the title Implement parsing code for PA register Implement scraper for PA register Dec 5, 2020
@MthwRobinson
Copy link
Contributor Author

Yeah, I'm planning to make the changes to the VA code today. Can link that in here so people can use it as an example

@MthwRobinson
Copy link
Contributor Author

The Regulation data model now has a to_xml method now that you can call. If you can turn your output into a Regulation object, you should be good to go. Here's an example of what it looks like for Virginia.

filename = os.path.join(issue_dir, f"{issue_id}.xml")
regulation = normalize_regulation(get_regulation(issue_id))
regulation.to_xml(filename)

def normalize_regulation(regulation):
"""Normalizes the regulation dictionary and converts it to the standardized
Regulation object.
Parameters
----------
regulation : dict
A dictionary representing the infromation scraped from the registry site.
Returns:
--------
normalized_regulation : Regulation
A base Regulation object
"""
normalized_reg = dict()
body = str()
for subtitle, content in regulation["content"].items():
body += f"{subtitle}. {content['description']}\n{content['text']}\n"
normalized_reg["body"] = body if body else None
contact = regulation.get("contact", None)
if contact:
first_name, last_name, email, phone = _parse_contact(contact)
contact = Contact.from_dict(
{
"first_name": first_name,
"last_name": last_name,
"email": email,
"phone": phone,
}
)
normalized_reg["contacts"] = [contact]
normalized_reg["effective_date"] = extract_date(regulation["effective_date"])
normalized_reg["register_date"] = extract_date(regulation["register_date"])
for key in vars(Regulation()):
if key not in normalized_reg and key in regulation:
normalized_reg[key] = regulation[key]
return Regulation.from_dict(normalized_reg)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request new_state Adding a new state to the project
Projects
None yet
Development

No branches or pull requests

3 participants