Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PinStatus.created and time-based pagination #39

Merged
merged 8 commits into from
Jul 30, 2020

Conversation

lidel
Copy link
Member

@lidel lidel commented Jul 27, 2020

This PR aims to address feedback about nondeterministic pagination, and inability to list pins created in specific time range:

  • adds PinStatus.created timestamp which indicates when a pin request was registered at pinning service (queued)
    • note: this is an immutable (write-once) value, it never changes
  • removes skip parameter from GET /pins
  • adds before and after filters to enable time-based pagination over items in deterministic way
    • this is also faster than the old offset based mode, because less data needs to be read to produce a batch

DOCS PREVIEW: https://ipfs.github.io/pinning-services-api-spec/#specUrl=https://raw.githubusercontent.com/ipfs/pinning-services-api-spec/feat/created-timestamp/ipfs-pinning-service.yaml

Closes #38 cc #12 @obo20 @GregTheGreek @priom @jsign @sanderpick @andrewxhill

This adds creation timestamp which indicates when a pin request was
registered at pinning service.

`before` and `after` filters replace `skip` to enable time-based
pagination over items in deterministic order.
jacobheun
jacobheun previously approved these changes Jul 28, 2020
Copy link
Collaborator

@jacobheun jacobheun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small language suggestions but otherwise this change looks reasonable to me and should improve determinism of results.

ipfs-pinning-service.yaml Outdated Show resolved Hide resolved
ipfs-pinning-service.yaml Outdated Show resolved Hide resolved
ipfs-pinning-service.yaml Outdated Show resolved Hide resolved
ipfs-pinning-service.yaml Outdated Show resolved Hide resolved
@obo20
Copy link

obo20 commented Jul 28, 2020

A few things on my mind here.

  • Do we need to default the limit to 1000? Keeping 1000 as a maximum seems fine to me, but I'm struggling to see the use cases for needing 1000 pins at a time by default. I would rather see something like 10-100 as the default and then applications needing the higher return limit can explicitly pass in the higher limit. That way we're not being wasteful by default if the user doesn't actually need to have 1000 items returned every call.

  • With this new filtering by date paradigm, how do we envision applications handling instances where there are more records than the return limit? If there are more instances than the limit returned, do they just make another request where the date is a millisecond after the oldest pin in their most recently returned set of items?

lidel and others added 4 commits July 28, 2020 17:55
Co-authored-by: Jacob Heun <jacobheun@gmail.com>
Co-authored-by: Jacob Heun <jacobheun@gmail.com>
Co-authored-by: Jacob Heun <jacobheun@gmail.com>
Co-authored-by: Jacob Heun <jacobheun@gmail.com>
@lidel
Copy link
Member Author

lidel commented Jul 28, 2020

@obo20

  • I'm ok with making the default lower. Unsure if 100 or 10. Lower 10 sounds like a good default for dev/terminal experimentation and encourages clients to use explicit limit that suits their needs. Thoughts?
  • The idea is client would take created timestamp from the oldest result in the batch and ask for a batch older than that (passing the timestamp in before filter).

@obo20
Copy link

obo20 commented Jul 28, 2020

  • I'm ok with making the default lower. Unsure if 100 or 10. Lower 10 sounds like a good default for dev/terminal experimentation and encourages clients to use explicit limit that suits their needs. Thoughts?

10 sounds perfect to me.

  • The idea is client would take created timestamp from the oldest result in the batch and ask for a batch older than that (passing the timestamp in before filter).

The biggest problem I see here is the new behavior of making the queued timestamp the same as the created timestamp. If we allow people to pass in up to 1000 items as part of the POST/pins{array of CIDs} endpoint then we can potentially have up to 1000 items with the same created timestamp which will make things problematic from a filtering perspective.

@lidel
Copy link
Member Author

lidel commented Jul 28, 2020

10 sounds perfect to me.

sgtm, set in 273d474

If we allow people to pass in up to 1000 items as part of the POST/pins{array of CIDs} endpoint then we can potentially have up to 1000 items with the same created timestamp which will make things problematic from a filtering perspective.

Good catch.
Value-based pagination proposed in this PR assumes PinStatus.created is unique+immutable+sortable

We could either:

  • (A) 💛 spec that each pin creation date needs to be unique
    (leaving it up to pinning service to ensure each object gets own timestamp that differs in ms/ns from other ones added in the same batch; sadly we need to trust there won't be any mistakes in implementation)
  • (B) 💚 change input of POST /pins to accept only one pin at a time
    (removes surface for introducing two pins with the same creation date)
  • (C) 💔 switch to cursor-based pagination over some other field that is unique+immutable+sortable
    (in practice we dont have good candidate for this, it means adding something new or using CID: &cursor={cid-of-pin-object} + sorting results lexicographically by CID – not the best for performance and UX/DX)

After thinking about it for a moment, I think (B) would be the cleanest and least error-prone way to tackle this. I don't believe any existing vendor supports pinning multiple CIDs in a single call, and unsure how valuable it would be to users anyway: if there is a need for that, one could simply create a DAG that has all of them and pin the root.

@obo20
Copy link

obo20 commented Jul 28, 2020

A and B both seem like feasible options. B certainly seems to be the easiest to implement, and it also follows the pattern of only one CID operation happening at a time that is seen in the modify / delete endpoints.

I think the spec could certainly be expanded to handle multiple requests in the future if we find that it's needed.

@lidel
Copy link
Member Author

lidel commented Jul 28, 2020

Ok, the rule of thumb for this API is to simplify and remove complexity where possible.
Went with option (B): switched the POST /pins to accept a single Pin object.

@lanzafame I tried to find some rationale for accepting multiple objects, but found none on github.
I believe you initially added it in v0.0.1 – was that just future-proofing, or is there an actual need for this?

@lanzafame
Copy link
Contributor

It was simply a less requests the better thing, but v0.0.1 version didn't have the level of metadata creation associated with that clearly complicates how accepting an array actually plays out.

@jacobheun
Copy link
Collaborator

Perhaps we should just keep skip. Adding the created date is helpful, but skip mitigates the issues with effectively trying to skip by created date. There is value in being able to fetch results before/after certain dates, but also being able to skip would simplify these problems.

I'm not opposed to limiting 1 pin per post, but that feels like the wrong solution for this problem.

Is there a good reason to get rid of skip?

@lidel
Copy link
Member Author

lidel commented Jul 29, 2020

Is there a good reason to get rid of skip?

Only reason is to simplify API (replace offset-based pagination with value-based one).
If we keep skip then we ask pinning services to support both pagination types.
In practice supporting offsets maps nicely to DB table rows and overhead to support both is minimal, so spec should be ok with keeping skip.

@obo20 any concerns with keeping skip ?

@jacobheun
Copy link
Collaborator

Only reason is to simplify API (replace offset-based pagination with value-based one).
If we keep skip then we ask pinning services to support both pagination types.

If we want to keep the api surface low we could just drop before and after filters for now. The main thing we lose here is the ability to filter results by time, which I think has the most value in client side cacheing (I fetched pins yesterday, so only give me ones after that time), but I think it's a low priority.

@obo20
Copy link

obo20 commented Jul 29, 2020

@lidel the main concerns with skip go back to the initial issue we were having with our pinned records being separate from our "ongoing operations" records (queued, searching, failed). The biggest reason for this is that the majority of our users will directly pin content via an http upload (and not from the network). As such it didn't make sense to keep "successful" pins in the same location as pins that had no guarantee of succeeding. I obviously can't speak for other service providers but I wouldn't be surprised if others had a similar separation of their database records.

We can certainly query both of these tables behind the scenes and then combine / parse the results before returning them to the user. However there's a few problems with this:

  • Introducing an offset into that situation breaks things as there's no way of knowing how the skip should be applied to the individual tables we're querying before combining them.
  • Querying both sets and then combining them has the potential to double / triple the amount of "work" needed for service providers to handle these types of requests

Skip would be possible if we were able to logically separate "pinned" objects from "ongoing" operations (queued, searching, failed).

For reference, here are some of the options we discussed throughout various threads that would allow for offsets:

A) Only allow querying for 1 status of pins at a time. (Cleanest, but could introduce extra calls to the API).

B) Only allow for querying either "pinned" or "ongoing" statuses, but not both. (Less API calls but doesn't feel very clean to me)

C) Have providers return a server error if they can't query both pinned / ongoing statuses at once (I would very much like to avoid this)

D) Return an object with an array for each status queried. Like this:

{ pinned: [], queued: [], searching: [] }

There's a few different ways pagination would work for option D:

  • Providing an offset for each status queried
  • Querying for multiple statuses initially and then allowing users to query individual status's with offsets from there. (From a UX standpoint I can't think of a situation where somebody would need to offset more than one status at once)

I'm personally l leaning towards Option A just because of how clean it is.
The extra API calls don't really bother me as a service provider as any latency should be fairly minimal and requests could easily be parallelized.

@lidel
Copy link
Member Author

lidel commented Jul 29, 2020

Ok, did some thinking and realized we should be cognizant of developer experience here:

  • The API should make it easy to do the right thing, even when implementer is in a rush, or lacks understanding of how DBs work.
    • What this means for this PR is that pagination should not come with any hidden footguns such as slowing down when the number of records is high, or skip over valid records when implemented naively.

AFAIK we solved those footguns by removing skip filter and by making POST /pins accept only a single Pin.
@jacobheun I don't think we need skip, all pagination needs can be done in a safer manner with before alone.

👉 In other words, I propose we merge this PR as-is.

Below is my rationale.


Why pagination with before is safer and more efficient than skip

The main goal of this PR is to switch from offset-based pagination (skip) to value-based pagination (before)

💔 When skip is used and implemented naively, the backend needs to read entire dataset, and then drop some results. It is non-trivial to implement this safely without deeper understanding of RDB mechanics and compromising performance.

💚 When before is used, the backend will only read from specified point using timestamp value column that is unique+sorted+immutable, which makes range requests efficient. This maps well to RDBs, one can leverage indexes etc.

I know problems with offset pagination can be solved with some additional work on the backend, but we have a better value-based option that works better without doing anything fancy.

To illustrate, borrowed some visuals from this blogpost:

Offset-based pagination (:broken_heart: skip)

Value-based pagination (:green_heart: before)

Footgun 1: the more records, the longer it takes to skip

blue is the time it takes to produce offset-based response, it grows with the number of records
red is value-based pagination, it grows much slower

Footgun 2: if a pin is from previous batch is removed, we may skip too much

if we paginate via before, we don't have this problem because batch always starts at deterministic record

Copy link
Collaborator

@jacobheun jacobheun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the analysis, this approach looks reasonable to me and should be fairly easy to expand in the future as needed.

I added a suggestion about noting the caveat with the created time stamp for value based pagination if bulk creates are ever supported. Not required, but imo it makes it easier to avoid creating problems in the future.

ipfs-pinning-service.yaml Show resolved Hide resolved
Co-authored-by: Jacob Heun <jacobheun@gmail.com>
@lidel lidel merged commit c6a0972 into master Jul 30, 2020
@lidel lidel deleted the feat/created-timestamp branch July 30, 2020 12:05
lidel added a commit that referenced this pull request Jul 30, 2020
Fix docs to match spec. The endpoint takes a single object now.
Details: #39
lidel added a commit that referenced this pull request Aug 11, 2020
This removes arrays and replaces them with single object in places
which were missed before when we made the switch to single ops.
Context:
#39 (comment)

Closes #46
@lidel lidel mentioned this pull request Sep 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Timestamp for "pin creation" event
4 participants