Support open tags. #13

AntouanK · 2019-12-30T19:49:19Z

Having an issue with using the HN api.

The HTML they send is not closing some tags ( <p> for example ).
I can assume this is not "valid HTML" but all browsers support it.
Can we have some option to accept it as well in the parser?

Example :

curl 'https://hack.ernews.info/api/graphql' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0' -H 'Accept: */*' -H 'Accept-Language: en-US,en;q=0.5' --compressed -H 'Content-Type: application/graphql' -H 'Origin: https://hack.ernews.info' -H 'DNT: 1' -H 'Connection: keep-alive' -H 'Referer: https://hack.ernews.info/comments-for/21914000' -H 'TE: Trailers' --data 'fragment itemDetails on Item { id, by, deleted, time, text, type, kids, parent, url } { myItems: itemsByIdList( idList: [21914002] ) { ...itemDetails } }'

and value is under text :

      {
        "id": 21914002,
        "by": "faustomorales",
        "deleted": null,
        "time": 1577720366,
        "text": "Hi HN! I made this because I wanted a toolkit for training custom OCR models that included both text detection and recognition along with the necessary tools to create synthetic data. Existing synthetic data generators had more dependencies and set-up than I felt was absolutely necessary so I took a different tack that limited dependencies to PIL only.<p>Some use cases for this package:<p>- You can use the pretrained (trained by others!) models for OCR (see the README for an example) on English text. [0]<p>- You can fine-tune a version of the detection and recognition models on a different alphabet &#x2F; language (see the tutorial [1]).<p>- You can just use the data generator with backgrounds and fonts (I provide a packaged set of both) to create images with character-level annotations for some other model [2].<p>I&#x27;d really like to continue improving the image generator to render more realistic images while retaining the existing mix of simplicity &#x2F; flexibility. Ideas welcome!<p>[0] <a href=\"https:&#x2F;&#x2F;keras-ocr.readthedocs.io&#x2F;en&#x2F;latest&#x2F;examples&#x2F;using_pretrained_models.html\" rel=\"nofollow\">https:&#x2F;&#x2F;keras-ocr.readthedocs.io&#x2F;en&#x2F;latest&#x2F;examples&#x2F;using_pr...</a><p>[1] <a href=\"https:&#x2F;&#x2F;keras-ocr.readthedocs.io&#x2F;en&#x2F;latest&#x2F;examples&#x2F;end_to_end_training.html\" rel=\"nofollow\">https:&#x2F;&#x2F;keras-ocr.readthedocs.io&#x2F;en&#x2F;latest&#x2F;examples&#x2F;end_to_e...</a><p>[2] <a href=\"https:&#x2F;&#x2F;keras-ocr.readthedocs.io&#x2F;en&#x2F;latest&#x2F;examples&#x2F;end_to_end_training.html#generating-synthetic-data\" rel=\"nofollow\">https:&#x2F;&#x2F;keras-ocr.readthedocs.io&#x2F;en&#x2F;latest&#x2F;examples&#x2F;end_to_e...</a>",
        "type": "comment",
        "kids": null,
        "parent": 21914000,
        "url": null
      }

The text was updated successfully, but these errors were encountered:

Devvypaws · 2020-01-02T22:10:02Z

In HTML 5, certain tags can omit their end tags and still be valid markup. In the HTML 5 spec's section on optional tags, the following elements are listed as having optional end tags:

html, head, body, li, dt, dd, p, rt, rp, optgroup, option, colgroup, caption, thead, tbody, tr, td, th, tfoot

The spec lists the rules for when each tag would be implicitly closed, so that can be a nice starting point for researching how feasible this is with the current parser's design.

AntouanK · 2020-01-03T08:23:52Z

I ended up using cheerio in my back-end, to "sanitize" all the HTML that comes from the HN api.
As a quick solution because it was blocking my UI entirely.

But still, since this package is basically the only way to render HTML string in Elm, I think we should consider this ticket.

@hecrj let us know what you think.

hecrj · 2020-01-04T17:05:34Z

The parser must definitely support the whole spec. Therefore, this is missing functionality.

However, I don't think I will be able to invest time on this soon. If anyone else is willing to give it a shot, by all means go for it! I will gladly code review any efforts.

I think a good starting point, before starting with the implementation, would be to set up an exhaustive test suite. We seem to be dealing with a bunch of different cases, each one with its own rules.

EarthenSky · 2021-07-04T19:54:57Z

html, head, body, li, dt, dd, p, rt, rp, optgroup, option, colgroup, caption, thead, tbody, tr, td, th, tfoot

In addition, I was having problems with unclosed <a> tags.

hecrj added enhancement New feature or request help wanted Extra attention is needed labels Jan 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support open tags. #13

Support open tags. #13

AntouanK commented Dec 30, 2019 •

edited

Loading

Devvypaws commented Jan 2, 2020

AntouanK commented Jan 3, 2020

hecrj commented Jan 4, 2020 •

edited

Loading

EarthenSky commented Jul 4, 2021

Support open tags. #13

Support open tags. #13

Comments

AntouanK commented Dec 30, 2019 • edited Loading

Devvypaws commented Jan 2, 2020

AntouanK commented Jan 3, 2020

hecrj commented Jan 4, 2020 • edited Loading

EarthenSky commented Jul 4, 2021

AntouanK commented Dec 30, 2019 •

edited

Loading

hecrj commented Jan 4, 2020 •

edited

Loading