Add PinStatus.created and time-based pagination #39

lidel · 2020-07-27T12:26:51Z

This PR aims to address feedback about nondeterministic pagination, and inability to list pins created in specific time range:

adds PinStatus.created timestamp which indicates when a pin request was registered at pinning service (queued)
- note: this is an immutable (write-once) value, it never changes
removes skip parameter from GET /pins
adds before and after filters to enable time-based pagination over items in deterministic way
- this is also faster than the old offset based mode, because less data needs to be read to produce a batch

DOCS PREVIEW: https://ipfs.github.io/pinning-services-api-spec/#specUrl=https://raw.githubusercontent.com/ipfs/pinning-services-api-spec/feat/created-timestamp/ipfs-pinning-service.yaml

Closes #38 cc #12 @obo20 @GregTheGreek @priom @jsign @sanderpick @andrewxhill

This adds creation timestamp which indicates when a pin request was registered at pinning service. `before` and `after` filters replace `skip` to enable time-based pagination over items in deterministic order.

jacobheun

Some small language suggestions but otherwise this change looks reasonable to me and should improve determinism of results.

ipfs-pinning-service.yaml

obo20 · 2020-07-28T15:38:32Z

A few things on my mind here.

Do we need to default the limit to 1000? Keeping 1000 as a maximum seems fine to me, but I'm struggling to see the use cases for needing 1000 pins at a time by default. I would rather see something like 10-100 as the default and then applications needing the higher return limit can explicitly pass in the higher limit. That way we're not being wasteful by default if the user doesn't actually need to have 1000 items returned every call.
With this new filtering by date paradigm, how do we envision applications handling instances where there are more records than the return limit? If there are more instances than the limit returned, do they just make another request where the date is a millisecond after the oldest pin in their most recently returned set of items?

Co-authored-by: Jacob Heun <jacobheun@gmail.com>

lidel · 2020-07-28T16:01:51Z

@obo20

I'm ok with making the default lower. Unsure if 100 or 10. Lower 10 sounds like a good default for dev/terminal experimentation and encourages clients to use explicit limit that suits their needs. Thoughts?
The idea is client would take created timestamp from the oldest result in the batch and ask for a batch older than that (passing the timestamp in before filter).

#39 (comment)

obo20 · 2020-07-28T16:37:13Z

I'm ok with making the default lower. Unsure if 100 or 10. Lower 10 sounds like a good default for dev/terminal experimentation and encourages clients to use explicit limit that suits their needs. Thoughts?

10 sounds perfect to me.

The idea is client would take created timestamp from the oldest result in the batch and ask for a batch older than that (passing the timestamp in before filter).

The biggest problem I see here is the new behavior of making the queued timestamp the same as the created timestamp. If we allow people to pass in up to 1000 items as part of the POST/pins{array of CIDs} endpoint then we can potentially have up to 1000 items with the same created timestamp which will make things problematic from a filtering perspective.

lidel · 2020-07-28T17:25:59Z

10 sounds perfect to me.

sgtm, set in 273d474

If we allow people to pass in up to 1000 items as part of the POST/pins{array of CIDs} endpoint then we can potentially have up to 1000 items with the same created timestamp which will make things problematic from a filtering perspective.

Good catch.
Value-based pagination proposed in this PR assumes PinStatus.created is unique+immutable+sortable

We could either:

(A) 💛 spec that each pin creation date needs to be unique
(leaving it up to pinning service to ensure each object gets own timestamp that differs in ms/ns from other ones added in the same batch; sadly we need to trust there won't be any mistakes in implementation)
(B) 💚 change input of POST /pins to accept only one pin at a time
(removes surface for introducing two pins with the same creation date)
(C) 💔 switch to cursor-based pagination over some other field that is unique+immutable+sortable
(in practice we dont have good candidate for this, it means adding something new or using CID: &cursor={cid-of-pin-object} + sorting results lexicographically by CID – not the best for performance and UX/DX)

After thinking about it for a moment, I think (B) would be the cleanest and least error-prone way to tackle this. I don't believe any existing vendor supports pinning multiple CIDs in a single call, and unsure how valuable it would be to users anyway: if there is a need for that, one could simply create a DAG that has all of them and pin the root.

obo20 · 2020-07-28T18:15:41Z

A and B both seem like feasible options. B certainly seems to be the easiest to implement, and it also follows the pattern of only one CID operation happening at a time that is seen in the modify / delete endpoints.

I think the spec could certainly be expanded to handle multiple requests in the future if we find that it's needed.

Context: #39 (comment) #39 (comment)

lidel · 2020-07-28T20:46:26Z

Ok, the rule of thumb for this API is to simplify and remove complexity where possible.
Went with option (B): switched the POST /pins to accept a single Pin object.

@lanzafame I tried to find some rationale for accepting multiple objects, but found none on github.
I believe you initially added it in v0.0.1 – was that just future-proofing, or is there an actual need for this?

lanzafame · 2020-07-29T02:00:00Z

It was simply a less requests the better thing, but v0.0.1 version didn't have the level of metadata creation associated with that clearly complicates how accepting an array actually plays out.

stale

jacobheun · 2020-07-29T10:23:04Z

Perhaps we should just keep skip. Adding the created date is helpful, but skip mitigates the issues with effectively trying to skip by created date. There is value in being able to fetch results before/after certain dates, but also being able to skip would simplify these problems.

I'm not opposed to limiting 1 pin per post, but that feels like the wrong solution for this problem.

Is there a good reason to get rid of skip?

lidel · 2020-07-29T13:20:52Z

Is there a good reason to get rid of skip?

Only reason is to simplify API (replace offset-based pagination with value-based one).
If we keep skip then we ask pinning services to support both pagination types.
In practice supporting offsets maps nicely to DB table rows and overhead to support both is minimal, so spec should be ok with keeping skip.

@obo20 any concerns with keeping skip ?

jacobheun · 2020-07-29T14:23:12Z

Only reason is to simplify API (replace offset-based pagination with value-based one).
If we keep skip then we ask pinning services to support both pagination types.

If we want to keep the api surface low we could just drop before and after filters for now. The main thing we lose here is the ability to filter results by time, which I think has the most value in client side cacheing (I fetched pins yesterday, so only give me ones after that time), but I think it's a low priority.

obo20 · 2020-07-29T15:28:53Z

@lidel the main concerns with skip go back to the initial issue we were having with our pinned records being separate from our "ongoing operations" records (queued, searching, failed). The biggest reason for this is that the majority of our users will directly pin content via an http upload (and not from the network). As such it didn't make sense to keep "successful" pins in the same location as pins that had no guarantee of succeeding. I obviously can't speak for other service providers but I wouldn't be surprised if others had a similar separation of their database records.

We can certainly query both of these tables behind the scenes and then combine / parse the results before returning them to the user. However there's a few problems with this:

Introducing an offset into that situation breaks things as there's no way of knowing how the skip should be applied to the individual tables we're querying before combining them.
Querying both sets and then combining them has the potential to double / triple the amount of "work" needed for service providers to handle these types of requests

Skip would be possible if we were able to logically separate "pinned" objects from "ongoing" operations (queued, searching, failed).

For reference, here are some of the options we discussed throughout various threads that would allow for offsets:

A) Only allow querying for 1 status of pins at a time. (Cleanest, but could introduce extra calls to the API).

B) Only allow for querying either "pinned" or "ongoing" statuses, but not both. (Less API calls but doesn't feel very clean to me)

C) Have providers return a server error if they can't query both pinned / ongoing statuses at once (I would very much like to avoid this)

D) Return an object with an array for each status queried. Like this:

{ pinned: [], queued: [], searching: [] }

There's a few different ways pagination would work for option D:

Providing an offset for each status queried
Querying for multiple statuses initially and then allowing users to query individual status's with offsets from there. (From a UX standpoint I can't think of a situation where somebody would need to offset more than one status at once)

I'm personally l leaning towards Option A just because of how clean it is.
The extra API calls don't really bother me as a service provider as any latency should be fairly minimal and requests could easily be parallelized.

lidel · 2020-07-29T22:10:15Z

Ok, did some thinking and realized we should be cognizant of developer experience here:

The API should make it easy to do the right thing, even when implementer is in a rush, or lacks understanding of how DBs work.
- What this means for this PR is that pagination should not come with any hidden footguns such as slowing down when the number of records is high, or skip over valid records when implemented naively.

AFAIK we solved those footguns by removing skip filter and by making POST /pins accept only a single Pin.
@jacobheun I don't think we need skip, all pagination needs can be done in a safer manner with before alone.

👉 In other words, I propose we merge this PR as-is.

Below is my rationale.

Why pagination with `before` is safer and more efficient than `skip`

The main goal of this PR is to switch from offset-based pagination (skip) to value-based pagination (before)

💔 When skip is used and implemented naively, the backend needs to read entire dataset, and then drop some results. It is non-trivial to implement this safely without deeper understanding of RDB mechanics and compromising performance.

💚 When before is used, the backend will only read from specified point using timestamp value column that is unique+sorted+immutable, which makes range requests efficient. This maps well to RDBs, one can leverage indexes etc.

I know problems with offset pagination can be solved with some additional work on the backend, but we have a better value-based option that works better without doing anything fancy.

To illustrate, borrowed some visuals from this blogpost:

Offset-based pagination (:broken_heart: skip)

Value-based pagination (:green_heart: before)

Footgun 1: the more records, the longer it takes to skip

blue is the time it takes to produce offset-based response, it grows with the number of records
red is value-based pagination, it grows much slower

Footgun 2: if a pin is from previous batch is removed, we may skip too much

if we paginate via before, we don't have this problem because batch always starts at deterministic record

jacobheun

Thanks for the analysis, this approach looks reasonable to me and should be fairly easy to expand in the future as needed.

I added a suggestion about noting the caveat with the created time stamp for value based pagination if bulk creates are ever supported. Not required, but imo it makes it easier to avoid creating problems in the future.

ipfs-pinning-service.yaml

Co-authored-by: Jacob Heun <jacobheun@gmail.com>

Fix docs to match spec. The endpoint takes a single object now. Details: #39

This removes arrays and replaces them with single object in places which were missed before when we made the switch to single ops. Context: #39 (comment) Closes #46

feat: add PinStatus.created

3cf79d3

This adds creation timestamp which indicates when a pin request was registered at pinning service. `before` and `after` filters replace `skip` to enable time-based pagination over items in deterministic order.

lidel requested review from jacobheun, achingbrain, aschmahmann, lanzafame and Gozala July 27, 2020 12:26

lidel mentioned this pull request Jul 27, 2020

Timestamp for "pin creation" event #38

Closed

lidel requested a review from a team July 28, 2020 12:48

jacobheun previously approved these changes Jul 28, 2020

View reviewed changes

ipfs-pinning-service.yaml Outdated Show resolved Hide resolved

ipfs-pinning-service.yaml Outdated Show resolved Hide resolved

ipfs-pinning-service.yaml Outdated Show resolved Hide resolved

ipfs-pinning-service.yaml Outdated Show resolved Hide resolved

lidel and others added 4 commits July 28, 2020 17:55

Update ipfs-pinning-service.yaml

1e22961

Co-authored-by: Jacob Heun <jacobheun@gmail.com>

Update ipfs-pinning-service.yaml

37c08e8

Co-authored-by: Jacob Heun <jacobheun@gmail.com>

Update ipfs-pinning-service.yaml

79e3557

Co-authored-by: Jacob Heun <jacobheun@gmail.com>

Update ipfs-pinning-service.yaml

9d6940b

Co-authored-by: Jacob Heun <jacobheun@gmail.com>

Set default value of limit parameter to 10

273d474

#39 (comment)

Switch POST /pins to a single object

0ec0637

Context: #39 (comment) #39 (comment)

jacobheun approved these changes Jul 30, 2020

View reviewed changes

ipfs-pinning-service.yaml Show resolved Hide resolved

Update ipfs-pinning-service.yaml

d4e9e1e

Co-authored-by: Jacob Heun <jacobheun@gmail.com>

lidel force-pushed the feat/created-timestamp branch from 9a5da19 to d4e9e1e Compare July 30, 2020 11:59

lidel merged commit c6a0972 into master Jul 30, 2020

lidel deleted the feat/created-timestamp branch July 30, 2020 12:05

lidel added a commit that referenced this pull request Jul 30, 2020

fix(docs): creating a new pin object

e51e84e

Fix docs to match spec. The endpoint takes a single object now. Details: #39

lidel added a commit that referenced this pull request Aug 11, 2020

fix: single pin I/O in POST operations

cb6970c

This removes arrays and replaces them with single object in places which were missed before when we made the switch to single ops. Context: #39 (comment) Closes #46

lidel mentioned this pull request Aug 11, 2020

fix: single pin I/O in POST operations #48

Merged

lidel mentioned this pull request Sep 22, 2020

Add Pin.created? #33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PinStatus.created and time-based pagination #39

Add PinStatus.created and time-based pagination #39

lidel commented Jul 27, 2020 •

edited

Loading

jacobheun left a comment

obo20 commented Jul 28, 2020 •

edited

Loading

lidel commented Jul 28, 2020

obo20 commented Jul 28, 2020 •

edited

Loading

lidel commented Jul 28, 2020 •

edited

Loading

obo20 commented Jul 28, 2020

lidel commented Jul 28, 2020 •

edited

Loading

lanzafame commented Jul 29, 2020

jacobheun commented Jul 29, 2020

lidel commented Jul 29, 2020

jacobheun commented Jul 29, 2020

obo20 commented Jul 29, 2020 •

edited

Loading

lidel commented Jul 29, 2020

Offset-based pagination (:broken_heart: `skip`)

Value-based pagination (:green_heart: `before`)

Footgun 1: the more records, the longer it takes to `skip`

Footgun 2: if a pin is from previous batch is removed, we may `skip` too much

jacobheun left a comment

Add PinStatus.created and time-based pagination #39

Add PinStatus.created and time-based pagination #39

Conversation

lidel commented Jul 27, 2020 • edited Loading

jacobheun left a comment

Choose a reason for hiding this comment

obo20 commented Jul 28, 2020 • edited Loading

lidel commented Jul 28, 2020

obo20 commented Jul 28, 2020 • edited Loading

lidel commented Jul 28, 2020 • edited Loading

obo20 commented Jul 28, 2020

lidel commented Jul 28, 2020 • edited Loading

lanzafame commented Jul 29, 2020

jacobheun commented Jul 29, 2020

lidel commented Jul 29, 2020

jacobheun commented Jul 29, 2020

obo20 commented Jul 29, 2020 • edited Loading

lidel commented Jul 29, 2020

Why pagination with before is safer and more efficient than skip

Offset-based pagination (:broken_heart: skip)

Value-based pagination (:green_heart: before)

Footgun 1: the more records, the longer it takes to skip

Footgun 2: if a pin is from previous batch is removed, we may skip too much

jacobheun left a comment

Choose a reason for hiding this comment

lidel commented Jul 27, 2020 •

edited

Loading

obo20 commented Jul 28, 2020 •

edited

Loading

obo20 commented Jul 28, 2020 •

edited

Loading

lidel commented Jul 28, 2020 •

edited

Loading

lidel commented Jul 28, 2020 •

edited

Loading

obo20 commented Jul 29, 2020 •

edited

Loading

Why pagination with `before` is safer and more efficient than `skip`

Offset-based pagination (:broken_heart: `skip`)

Value-based pagination (:green_heart: `before`)

Footgun 1: the more records, the longer it takes to `skip`

Footgun 2: if a pin is from previous batch is removed, we may `skip` too much