Skip to content
This repository has been archived by the owner on Feb 12, 2024. It is now read-only.

feat: store pins in datastore instead of a DAG #2771

Merged
merged 47 commits into from
Aug 25, 2020

Conversation

achingbrain
Copy link
Member

@achingbrain achingbrain commented Feb 12, 2020

Adds a .pins datastore to ipfs-repo and uses that to store pins as cbor binary keyed by base32 encoded multihashes (n.b. not CIDs).

Format

As stored in the datastore, each pin has several fields:

{
  codec: // optional Number, the codec from the CID that this multihash was pinned with, if omitted, treated as 'dag-pb'
  version: // optional Number, the version number from the CID that this multihash was pinned with, if omitted, treated as v0
  depth: // Number Infinity = recursive pin, 0 = direct, 1+ = pinned to a depth
  name: // optional String a user-friendly name for the pin
  metadata: // optional Object, user-defined data for the pin
}

Notes:

.codec and .version are stored so we can recreate the original CID when listing pins.

Metadata

The intention is for us to be able to add extra fields that have technical meaning to the root of the object, and the user can store application-specific data in the metadata field.

CLI

$ ipfs pin add bafyfoo --metadata key1=value1,key2=value2
$ ipfs pin add bafyfoo --metadata-format=json --metadata '{"key1":"value1","key2":"value2"}'

$ ipfs pin list
bafyfoo

$ ipfs pin list -l
CID      Name    Type       Metadata
bafyfoo  My pin  Recursive  {"key1":"value1","key2":"value2"}

$ ipfs pin metadata Qmfoo --format=json
{"key1":"value1","key2":"value2"}

HTTP API

  • '/api/v0/pin/add' route adds new metadata argument, accepts a json string
  • '/api/v0/pin/metadata' returns metadata as json

Future tech:

  • Pin namespaces? E.g. the datastore key would be /default/C19A797..., /my-namespace/C19A797...
    • ipfs pin ls --namespace=my-namespace
  • Query interface
    • ipfs pin query metadata.key1=value1

Core API

  • ipfs.pin.addAll accepts and returns an async iterator
  • ipfs.pin.rmAll accepts and returns an async iterator
// pass a cid or IPFS Path with options
const { cid } = await ipfs.pin.add(new CID('/ipfs/Qmfoo'), {
  recursive: false,
  metadata: {
    key: 'value
  },
  timeout: 2000
}))

// pass an iterable of CIDs
const [{ cid: cid1 }, { cid: cid2 }] = await all(ipfs.pin.addAll([
  new CID('/ipfs/Qmfoo'),
  new CID('/ipfs/Qmbar')
], { timeout: '2s' }))

// pass an iterable of objects with options
const [{ cid: cid1 }, { cid: cid2 }] = await all(ipfs.pin.addAll([
  { cid: new CID('/ipfs/Qmfoo'), recursive: true, comments: 'A recursive pin' },
  { cid: new CID('/ipfs/Qmbar'), recursive: false, comments: 'A direct pin' }
], { timeout: '2s' }))
  • ipfs.pin.rmAll accepts and returns an async generator (other input types are available)
// pass an IPFS Path or CID
const { cid } = await ipfs.rm(new CID('/ipfs/Qmfoo/file.txt'))

// pass options
const { cid } = await all(ipfs.rm(new CID('/ipfs/Qmfoo'), { recursive: true }))

// pass an iterable of CIDs or objects with options
const [{ cid }] = await all(ipfs.rmAll([{ cid: new CID('/ipfs/Qmfoo'), recursive: true }]))

Bonus: Lets us pipe the output of one command into another:

await pipe(
	ipfs.pin.ls({ type: 'recursive' }),
    (source) => ipfs.pin.rmAll(source)
)

// or
await all(ipfs.pin.rmAll(ipfs.pin.ls({ type: 'recursive'})))

Todo:

  • dedupe interface pinning tests
  • ipfs-repo migration script

Depends on:

BREAKING CHANGES:

  • pins are now stored in a datastore, a repo migration will be necessary

@achingbrain
Copy link
Member Author

Refs:

#2650
#2197

Supersedes:

#2198

@achingbrain
Copy link
Member Author

achingbrain commented Feb 12, 2020

Adhoc testing script. Add a buffer without pinning it, time how long it takes to pin it. Store the time and work out the average time taken every 100 pins:

'use strict'

const last = require('it-last')
const drain = require('it-drain')

const { createController } = require('ipfsd-ctl')

async function main () {
  const ipfs = (await createController({
    type: 'go',
    ipfsBin: require('go-ipfs-dep').path(),
    ipfsHttpModule: require('ipfs-http-client'),
    disposable: false
  }))
  await ipfs.init()
  await ipfs.start()

  let times = []
  let chunk = 0

  for (let i = 0; i < 83000; i++) {
    const buf = Buffer.from(`${Math.random()}`)

    const result = await last(ipfs.api.add(buf, {
      pin: false
    }))

    const start = Date.now()

    const res = await ipfs.api.pin.add(result.cid)

    if (res[Symbol.asyncIterator]) {
      await drain(res)
    }

    const mem = process.memoryUsage()

    times.push({
      ...mem,
      elapsed: Date.now() - start
    })

    chunk++

    if (chunk === 1000) {
      const sum = times.reduce((acc, curr) => {
        acc.elapsed += curr.elapsed
        acc.rss += curr.rss
        acc.heapTotal += curr.heapTotal
        acc.heapUsed += curr.heapUsed
        acc.external += curr.external

        return acc
      }, { elapsed: 0, rss: 0, heapTotal: 0, heapUsed: 0, external: 0 })

      console.info(`${i + 1}, ${sum.elapsed / times.length}, ${sum.rss / times.length}, ${sum.heapTotal / times.length}, ${sum.heapUsed / times.length}, ${sum.external / times.length}`)

      chunk = 0
      times = []
    }
  }

  await ipfs.stop()
}

main()

Results:

10k pins, DAG vs datastore, ranges from 20-300x speedup in time taken to add a single pin:

image

After 100k pins, there doesn't seem to be much performance degredation in storing in the datastore whereas the DAG method degrades significantly after 8192 pins (see #2197 for discussion of that):

image

The next significant performance jump vs DAGs would probably be after the first layer of buckets is full - e.g. 256 buckets of 8192 pins = 2,097,152 pins. That'll probably take a bit of time to benchmark...

@achingbrain
Copy link
Member Author

achingbrain commented Feb 12, 2020

Next steps:

  • Ensure all tests are passing
  • Add concurrency to increase perf when fetching indirectly pinned CIDs for very large DAGs
  • Write repo migration script
  • Add docs and tests for named pins
  • Make resolvePath util async iterable

@alanshaw
Copy link
Member

That's a very cool speed improvement! Some observations:

  • Double storing of CID
  • Repo is no longer compatible with go-ipfs repo
  • We lose the ability to share pinsets via IPFS

@achingbrain
Copy link
Member Author

Double storing of CID

I guess you could only store the cid version/codec in the pin? I was thinking of changing the pin type to be an integer too, so there are definitely some improvements that can be made, this is just a first pass.

Repo is no longer compatible with go-ipfs repo

@Stebalien has talked about making a similar change to this too so it's only slightly ahead of the go-ipfs repo.

At any rate, go-ipfs is switching to badger by default which js-ipfs can't read so I'm not sure how much of a priority that is any more.

We lose the ability to share pinsets via IPFS

I guess you can't share your entire list of pins by sharing one CID, but also now do you don't have to share your entire list of pins, you can share individual ones.

Grouping multiple pins as pinsets could be added back in as a new feature, the human readable names would make this nicer to work with.

Something like:

$ ipfs pin add Qmfoo
pinned Qmfoo recursively
$ ipfs pin-set add my-super-fun-pinset Qmfoo
$ ipfs pin-set list my-super-fun-pinset
my-super-fun-pinset Qmqux
  Qmfoo
  Qmbar
  Qmbaz

You could event have the root of a pinset be an IPNS name to allow pulling updates from the network. That'd be neat.

@Stebalien
Copy link
Member

Repo is no longer compatible with go-ipfs repo

This can be fixed :).

We lose the ability to share pinsets via IPFS

At the moment, I think this is causing strictly more harm than help. It's been 6 years and I have yet to see someone use this.

Ideally, everything would be stored in an IPLD-backed graph database. However, we aren't there yet in terms of tooling.

We could get part way there by creating an IPLD-backed datastore (datastore -> IPLD HAMT -> datastore) but that will throw away the type information.

Double storing of CID

Any reason to store the CID?

keyed by b58

Base64url? Go-ipfs, at least, now has hyper-optimized base58. However, it's still slower than base64 (and takes more space).


Questions/comments.

  • How much size does this take? The definitions of these pins is now significantly larger so we should account for the overhead.
  • Let's make sure to allow for arbitrary fields.
  • Are the names unique?
  • Given that we can't efficiently query by name, we might want to consider just calling them comments.
  • We might want to consider mapping names to pins, instead of CIDs to pins. See Voker's work: Named pins & pins stored in datastore kubo#4757. The only concern here is that it significantly changes the API where as the current version just optimizes things a bit.

@achingbrain achingbrain force-pushed the refactor/store-pins-in-datastore branch from 43316eb to 80986f9 Compare February 13, 2020 11:56
@achingbrain
Copy link
Member Author

Some more graphs. I pinned 83k single blocks using the test script above (originally intended to be 100k but the js-dag benchmark took too long to run and I had to get on an aeroplane).

image

The initial hump at 8192 pins is there, then a consistent performance degradation over time. At 83k pins, js is taking 2.5s to add a pin. Go has the same degradation but it is significantly less pronounced.

image

The js-dag implementation stores the pinsets in memory, js-datastore does not. There is an increase in memory usage over time but it's may not be hitting the v8 gc threshold, or there's a leak somewhere...

image

How much size does this take? The definitions of these pins is now significantly larger so we should account for the overhead.

The sizes appear to be comparable, or perhaps they are statistically insignificant compared to the block size.

After completing the benchmark and running repo gc I see:

# js-dag
.jsipfs $  du -hs
367M	.

# go-dag
.ipfs $ du -hs
353M	.

# js-datastore
.jsipfs $ du -hs
344M	.

Let's make sure to allow for arbitrary fields.

Yes, this is the idea behind storing them CBOR encoded rather than protobufs.

Are the names unique?
Given that we can't efficiently query by name, we might want to consider just calling them comments.

Good suggestion, names are not unique so comments might be a better field name.

We might want to consider mapping names to pins, instead of CIDs to pins

If we're not going let the user query by name we probably shouldn't do this.

Any reason to store the CID?

My thinking was that by using the multihash of a block as the pin identifier (not the full CID), it becomes cheap to calculate if a given block has already been pinned (assuming the user has hashed it with the same algorithm).

The full CID is stored so we can show the user what they used to pin the block when they do a ipfs.pin.ls().

@achingbrain
Copy link
Member Author

cc @hsanjuan

Adds a `.pins` datastore to `ipfs-repo` and uses that to store pins as cbor
binary keyed by base64 stringified multihashes (n.b. not CIDs).

Each pin has several fields:

```javascript
{
  cid: // buffer, the full CID pinned
  type: // string, 'recursive' or 'direct'
  comments: // string, human-readable comments for the pin
}
```

BREAKING CHANGES:

* pins are now stored in a datastore, a repo migration will be necessary
* ipfs.pins.add now returns an async generator
* ipfs.pins.rm now returns an async generator

Depends on:

- [ ] ipfs/js-ipfs-repo#221
@achingbrain achingbrain force-pushed the refactor/store-pins-in-datastore branch from a5ec30e to 582be49 Compare March 4, 2020 08:30
@achingbrain achingbrain changed the title feat: store pins in datastore instead of DAG feat: store pins in datastore instead of a DAG Mar 4, 2020
achingbrain added a commit to ipfs/interop that referenced this pull request Mar 5, 2020
The changes in ipfs/js-ipfs#2771 mean that
the input/output of `ipfs.pins.add` and `ipfs.pins.rm` are now
streaming so this PR updates to the new API.
@achingbrain achingbrain marked this pull request as ready for review March 6, 2020 17:40
@achingbrain achingbrain merged commit 64b7fe4 into master Aug 25, 2020
@achingbrain achingbrain deleted the refactor/store-pins-in-datastore branch August 25, 2020 06:20
SgtPooki referenced this pull request in ipfs/js-kubo-rpc-client Aug 18, 2022
Adds a `.pins` datastore to `ipfs-repo` and uses that to store pins as cbor binary keyed by multihash.

### Format

As stored in the datastore, each pin has several fields:

```javascript
{
  codec: // optional Number, the codec from the CID that this multihash was pinned with, if omitted, treated as 'dag-pb'
  version: // optional Number, the version number from the CID that this multihash was pinned with, if omitted, treated as v0
  depth: // Number Infinity = recursive pin, 0 = direct, 1+ = pinned to a depth
  comments: // optional String user-friendly description of the pin
  metadata: // optional Object, user-defined data for the pin
}
```

Notes:

`.codec` and `.version` are stored so we can recreate the original CID when listing pins.

### Metadata

The intention is for us to be able to add extra fields that have technical meaning to the root of the object, and the user can store application-specific data in the `metadata` field.

### CLI

```console
$ ipfs pin add bafyfoo --metadata key1=value1,key2=value2
$ ipfs pin add bafyfoo --metadata-format=json --metadata '{"key1":"value1","key2":"value2"}'

$ ipfs pin list
bafyfoo

$ ipfs pin list -l
CID      Name    Type       Metadata
bafyfoo  My pin  Recursive  {"key1":"value1","key2":"value2"}

$ ipfs pin metadata Qmfoo --format=json
{"key1":"value1","key2":"value2"}
```

### HTTP API

* '/api/v0/pin/add' route adds new `metadata` argument, accepts a json string
* '/api/v0/pin/metadata' returns metadata as json

### Core API

* `ipfs.pin.addAll` accepts and returns an async iterator
* `ipfs.pin.rmAll` accepts and returns an async iterator

```javascript
// pass a cid or IPFS Path with options
const { cid } = await ipfs.pin.add(new CID('/ipfs/Qmfoo'), {
  recursive: false,
  metadata: {
    key: 'value
  },
  timeout: 2000
}))

// pass an iterable of CIDs
const [{ cid: cid1 }, { cid: cid2 }] = await all(ipfs.pin.addAll([
  new CID('/ipfs/Qmfoo'),
  new CID('/ipfs/Qmbar')
], { timeout: '2s' }))

// pass an iterable of objects with options
const [{ cid: cid1 }, { cid: cid2 }] = await all(ipfs.pin.addAll([
  { cid: new CID('/ipfs/Qmfoo'), recursive: true, comments: 'A recursive pin' },
  { cid: new CID('/ipfs/Qmbar'), recursive: false, comments: 'A direct pin' }
], { timeout: '2s' }))
```

* ipfs.pin.rmAll accepts and returns an async generator (other input types are available)

```javascript
// pass an IPFS Path or CID
const { cid } = await ipfs.rm(new CID('/ipfs/Qmfoo/file.txt'))

// pass options
const { cid } = await all(ipfs.rm(new CID('/ipfs/Qmfoo'), { recursive: true }))

// pass an iterable of CIDs or objects with options
const [{ cid }] = await all(ipfs.rmAll([{ cid: new CID('/ipfs/Qmfoo'), recursive: true }]))
```

Bonus: Lets us pipe the output of one command into another:

```javascript
await pipe(
	ipfs.pin.ls({ type: 'recursive' }),
    (source) => ipfs.pin.rmAll(source)
)

// or
await all(ipfs.pin.rmAll(ipfs.pin.ls({ type: 'recursive'})))
```

BREAKING CHANGES:

* pins are now stored in a datastore, a repo migration will occur on startup
* All deps of this module now use Uint8Arrays in place of node Buffers
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants