Implement scraper for PA register #10

MthwRobinson · 2020-12-02T02:49:42Z

The goal of this issue is to implement the code for downloading regulations from the Pennsylvania register and converting the output to XML. For now, you can store the resulting XML files in a local directory. The directory should use the following convention: pa/{volume}/{number}/{regulation_number}.xml.

The Pennsylvania register has the following disclaimer. Since we're not reselling the data or using it for profit, we should be good to go.

No part of the information on this site may be reused for profit or sold for profit.

The text was updated successfully, but these errors were encountered:

kvnkho · 2020-12-03T22:50:10Z

Does a parser really need to be to handle XML or can we do it in JSON then convert?

MthwRobinson · 2020-12-05T15:36:16Z

The reason for the XML is:

A lot of the regulations (like this one) have strike throughs and other markdown formatting that would be useful to maintain.
I'd like to enforce use of a common schema, which is a little easier in XML.
I find XML to be more human readable;

That said, I am planning on implementing a helper function that converts a dict to XML, so if you can get all the data into a dictionary, you an just call that.

kvnkho · 2020-12-05T15:40:34Z

Got it that makes sense. Yeah, I was wondering about the VA code was going to be re-written or just converted. If converted, that helper function would be useful across the registers.

MthwRobinson · 2020-12-05T15:42:29Z

Yeah, I'm planning to make the changes to the VA code today. Can link that in here so people can use it as an example

MthwRobinson · 2020-12-06T21:28:34Z

The Regulation data model now has a to_xml method now that you can call. If you can turn your output into a Regulation object, you should be good to go. Here's an example of what it looks like for Virginia.

data-ingest/data_ingest/regs/va.py

Lines 59 to 61 in e72c775

 filename = os.path.join(issue_dir, f"{issue_id}.xml") 

 regulation = normalize_regulation(get_regulation(issue_id)) 

 regulation.to_xml(filename)

data-ingest/data_ingest/regs/va.py

Lines 113 to 154 in e72c775

 def normalize_regulation(regulation): 

 """Normalizes the regulation dictionary and converts it to the standardized 

  Regulation object. 

  Parameters 

  ---------- 

  regulation : dict 

  A dictionary representing the infromation scraped from the registry site. 

  Returns: 

  -------- 

  normalized_regulation : Regulation 

  A base Regulation object 

  """ 

 normalized_reg = dict() 

 body = str() 

 for subtitle, content in regulation["content"].items(): 

 body += f"{subtitle}. {content['description']}\n{content['text']}\n" 

 normalized_reg["body"] = body if body else None 

 contact = regulation.get("contact", None) 

 if contact: 

 first_name, last_name, email, phone = _parse_contact(contact) 

 contact = Contact.from_dict( 

 { 

 "first_name": first_name, 

 "last_name": last_name, 

 "email": email, 

 "phone": phone, 

 } 

 ) 

 normalized_reg["contacts"] = [contact] 

 normalized_reg["effective_date"] = extract_date(regulation["effective_date"]) 

 normalized_reg["register_date"] = extract_date(regulation["register_date"]) 

 for key in vars(Regulation()): 

 if key not in normalized_reg and key in regulation: 

 normalized_reg[key] = regulation[key] 

 return Regulation.from_dict(normalized_reg)

MthwRobinson added feature New feature or request new_state Adding a new state to the project labels Dec 2, 2020

kvnkho assigned kvnkho and wbcai and unassigned kvnkho Dec 3, 2020

MthwRobinson changed the title ~~Implement parsing code for PA register~~ Implement scraper for PA register Dec 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement scraper for PA register #10

Implement scraper for PA register #10

MthwRobinson commented Dec 2, 2020

kvnkho commented Dec 3, 2020

MthwRobinson commented Dec 5, 2020

kvnkho commented Dec 5, 2020

MthwRobinson commented Dec 5, 2020

MthwRobinson commented Dec 6, 2020

Implement scraper for PA register #10

Implement scraper for PA register #10

Comments

MthwRobinson commented Dec 2, 2020

kvnkho commented Dec 3, 2020

MthwRobinson commented Dec 5, 2020

kvnkho commented Dec 5, 2020

MthwRobinson commented Dec 5, 2020

MthwRobinson commented Dec 6, 2020