Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support open tags. #13

Open
AntouanK opened this issue Dec 30, 2019 · 4 comments
Open

Support open tags. #13

AntouanK opened this issue Dec 30, 2019 · 4 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@AntouanK
Copy link

AntouanK commented Dec 30, 2019

Having an issue with using the HN api.

The HTML they send is not closing some tags ( <p> for example ).
I can assume this is not "valid HTML" but all browsers support it.
Can we have some option to accept it as well in the parser?

Example :

curl 'https://hack.ernews.info/api/graphql' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0' -H 'Accept: */*' -H 'Accept-Language: en-US,en;q=0.5' --compressed -H 'Content-Type: application/graphql' -H 'Origin: https://hack.ernews.info' -H 'DNT: 1' -H 'Connection: keep-alive' -H 'Referer: https://hack.ernews.info/comments-for/21914000' -H 'TE: Trailers' --data 'fragment itemDetails on Item { id, by, deleted, time, text, type, kids, parent, url } { myItems: itemsByIdList( idList: [21914002] ) { ...itemDetails } }'

and value is under text :

      {
        "id": 21914002,
        "by": "faustomorales",
        "deleted": null,
        "time": 1577720366,
        "text": "Hi HN! I made this because I wanted a toolkit for training custom OCR models that included both text detection and recognition along with the necessary tools to create synthetic data. Existing synthetic data generators had more dependencies and set-up than I felt was absolutely necessary so I took a different tack that limited dependencies to PIL only.<p>Some use cases for this package:<p>- You can use the pretrained (trained by others!) models for OCR (see the README for an example) on English text. [0]<p>- You can fine-tune a version of the detection and recognition models on a different alphabet &#x2F; language (see the tutorial [1]).<p>- You can just use the data generator with backgrounds and fonts (I provide a packaged set of both) to create images with character-level annotations for some other model [2].<p>I&#x27;d really like to continue improving the image generator to render more realistic images while retaining the existing mix of simplicity &#x2F; flexibility. Ideas welcome!<p>[0] <a href=\"https:&#x2F;&#x2F;keras-ocr.readthedocs.io&#x2F;en&#x2F;latest&#x2F;examples&#x2F;using_pretrained_models.html\" rel=\"nofollow\">https:&#x2F;&#x2F;keras-ocr.readthedocs.io&#x2F;en&#x2F;latest&#x2F;examples&#x2F;using_pr...</a><p>[1] <a href=\"https:&#x2F;&#x2F;keras-ocr.readthedocs.io&#x2F;en&#x2F;latest&#x2F;examples&#x2F;end_to_end_training.html\" rel=\"nofollow\">https:&#x2F;&#x2F;keras-ocr.readthedocs.io&#x2F;en&#x2F;latest&#x2F;examples&#x2F;end_to_e...</a><p>[2] <a href=\"https:&#x2F;&#x2F;keras-ocr.readthedocs.io&#x2F;en&#x2F;latest&#x2F;examples&#x2F;end_to_end_training.html#generating-synthetic-data\" rel=\"nofollow\">https:&#x2F;&#x2F;keras-ocr.readthedocs.io&#x2F;en&#x2F;latest&#x2F;examples&#x2F;end_to_e...</a>",
        "type": "comment",
        "kids": null,
        "parent": 21914000,
        "url": null
      }
@Devvypaws
Copy link
Contributor

In HTML 5, certain tags can omit their end tags and still be valid markup. In the HTML 5 spec's section on optional tags, the following elements are listed as having optional end tags:

html, head, body, li, dt, dd, p, rt, rp, optgroup, option, colgroup, caption, thead, tbody, tr, td, th, tfoot

The spec lists the rules for when each tag would be implicitly closed, so that can be a nice starting point for researching how feasible this is with the current parser's design.

@AntouanK
Copy link
Author

AntouanK commented Jan 3, 2020

I ended up using cheerio in my back-end, to "sanitize" all the HTML that comes from the HN api.
As a quick solution because it was blocking my UI entirely.

But still, since this package is basically the only way to render HTML string in Elm, I think we should consider this ticket.

@hecrj let us know what you think.

@hecrj hecrj added enhancement New feature or request help wanted Extra attention is needed labels Jan 4, 2020
@hecrj
Copy link
Owner

hecrj commented Jan 4, 2020

The parser must definitely support the whole spec. Therefore, this is missing functionality.

However, I don't think I will be able to invest time on this soon. If anyone else is willing to give it a shot, by all means go for it! I will gladly code review any efforts.

I think a good starting point, before starting with the implementation, would be to set up an exhaustive test suite. We seem to be dealing with a bunch of different cases, each one with its own rules.

@EarthenSky
Copy link

html, head, body, li, dt, dd, p, rt, rp, optgroup, option, colgroup, caption, thead, tbody, tr, td, th, tfoot

In addition, I was having problems with unclosed <a> tags.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants