Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to multi-row. #336

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

juniorz
Copy link
Contributor

@juniorz juniorz commented Feb 6, 2017

Multi-row allows to inherit fields from a previous HTML element (<tr>) which contains information about subsequent elements (<td>). This is common in trackers that display items in a "grouped" layout.

This feature extends (and replace the "dateheaders" feature) allowing to inherit more fields.

Compare the tests TestIndexerDefinitionRunner_MultiRowSearch and
TestIndexerDefinitionRunner_DateHeadersSearch to see how to replace the
"dateheaders".

@kaso17
Copy link
Collaborator

kaso17 commented Feb 6, 2017

Thank you for your PR.
You've some real world definitions for trackers using this new feature?
I can see how this will work with traditional gazelle based trackers having only grouped results.
But what would happen in case we have mixed results (grouped and ungrouped results)? Wouldn't it use the wrong (previous) group header for ungrouped results?

@kaso17
Copy link
Collaborator

kaso17 commented Feb 6, 2017

My idea of implementing support for groups was to introduce row types:
This is an example how a definition would look:

  search:
    path: /torrents.php
    inputs:
      ...
    rows:
      selector: table#torrents_table_classic > tbody > tr # match all rows
      types:
        group_header: # name of the group
          selector: .torrent_group_header # only process this group if the selector matches
          result: false # intenal row, won't produce an "item"
          fields:
            group_title: # selectorBlock
              selector: td:nth-child(1)
            group_year: # selectorBlock
              selector: td:nth-child(2)

        group_child_torrent:
          selector: .group_torrent
          result: true # this row will result in a torrent result/"item"
          fields:
            title:
              selector: td:nth-child(1)
              filters:
                - name: prepend
                  args: "{{ .group_header.group_title }} [{{ .group_header.group_year }}] " # access vaiables from last row match
            details:
              selector: a
            ...
              
        standalone_torrent:
          selector: .torrent
          result: true
          fields:
            title:
              selector: td:nth-child(1)
            details:
              selector: a
            ...

That would make parsing much more flexible

@juniorz
Copy link
Contributor Author

juniorz commented Feb 6, 2017

I would be great to be able to store "queries" in variables and use them later. Let me see if I understand the proposal correctly: for my use case the group is usually the first row that matches a specific selector before the result row. Example:

<table>
  <tr class="group">
    <td class="category">TV Shows</td>
    <td class="name" colspan="3">My TV Show Name</td>
  </tr>
  <tr class="result">
    <td>&nbsp;</td>
    <td class="name">S01E01</td>
    <td class="leecheers-and-seeders">10/25</td>
    <td><a href="#">Download</a></td>
  </tr>
  <tr class="result">
    <td>&nbsp;</td>
    <td class="name">S01E02</td>
    <td class="leecheers-and-seeders">20/40</td>
    <td><a href="#">Download</a></td>
  </tr>
  <tr class="group">
    <td class="category">Movies</td>
    <td class="name" colspan="3">Awesome blockbuster</td>
  </tr>
  <tr class="result">
    <td>&nbsp;</td>
    <td class="name">720p</td>
    <td class="leecheers-and-seeders">7/15</td>
    <td><a href="#">Download</a></td>
  </tr>
  <tr class="result">
    <td>&nbsp;</td>
    <td class="name">1080p</td>
    <td class="leecheers-and-seeders">50/70</td>
    <td><a href="#">Download</a></td>
  </tr>
</table>

If rows.selector matches all rows, how are you planning to filter out what should be considered the generator for the result list? Are you planning to use every row.types.* without result=falseas generator for the result list?

I believe this should work, but how are you planning to handle subsequent results in the same group? Are you planning to keep every row.types.* with result = false available for substitution (like you did in group_child_torrent) until it matches again?

@juniorz
Copy link
Contributor Author

juniorz commented Feb 6, 2017

I have used this multi-row strategy to implement grouped results in the BJ Share tracker (https://bj-share.me). By skipping the group search (rather than making it an error) when it cant be found, I managed to get grouped and ungrouped results in the same definition.

See: https://gist.github.com/juniorz/e3d2492f91603e4c392dd551d931aaa8#file-bjshare-multi-row-yml

@kaso17
Copy link
Collaborator

kaso17 commented Feb 6, 2017

    rows:
      selector: table > tr # match all rows
      types:
        group:
          selector: tr.group
          result: false
          fields:
            name:
              selector: td.name
            category:
              selector: td.category
        
        result:
          selector: tr.result
          result: true
          fields:
            title:
              selector: td.name
              filters:
                - name: prepend
                  args: "{{ .group.name }} "
            category:
              text: "{{ .group.category }}"
            seeders: 
              selector: td.leecheers-and-seeders
              filters:
                - name: split
                  args: ["/", 0]
            leechers: 
              selector: td.leecheers-and-seeders
              filters:
                - name: split
                  args: ["/", 1]
            download:
              selector: a

would be an example definition for your example HTML

Only types with result=true will generate new items for the result list.

Whenever a row matching a type is parsed it would update the variable .$TYPE_NAME.$FIELD. The variables can be accessed until they're overwritten by the same type again.

@kaso17
Copy link
Collaborator

kaso17 commented Feb 6, 2017

I tried your implementation and as expected it doesn't work with standalone torrents:
https://nimbus.everhelper.me/client/notes/share/755139/35iiicf1ouk7x547knc5

@kaso17
Copy link
Collaborator

kaso17 commented Feb 6, 2017

A small addition of my previous suggestion, we could make the "result" (Is there a better name for it?) field a string instead of boolean.

  • none: doesn't generate a result item
  • new: generate a new result item
  • last: add to the current/last result item

With last we could get rid of the after statement too.

Multi-row allows to inherit fields from a previous HTML element (<tr>)
which contains information about subsequent elements (<td>). This is
common in trackers that display items in a "grouped" layout.

This feature extends (and replace the "dateheaders" feature) allowing to
inherit more fields.

Compare the tests TestIndexerDefinitionRunner_MultiRowSearch and
TestIndexerDefinitionRunner_DateHeadersSearch to see how to replace the
"dateheaders".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants