Skip to content
This repository has been archived by the owner on Feb 12, 2024. It is now read-only.

Embracing web native FormData / File where possible #3029

Closed
Gozala opened this issue May 11, 2020 · 14 comments
Closed

Embracing web native FormData / File where possible #3029

Gozala opened this issue May 11, 2020 · 14 comments
Assignees
Labels
env:browser exp/wizard Extensive knowledge (implications, ramifications) required kind/maintenance Work required to avoid breaking changes or harm to project's status quo P1 High: Likely tackled by core team if no one steps up topic/perf Performance

Comments

@Gozala
Copy link
Contributor

Gozala commented May 11, 2020

As discussed at #2838 (comment) mainstream browsers today do not yet support passingReadableStream as request body. To workaround this limitation currently JS-IPFS encodes stream of content (on ipfs.add / ipfs.write) into multipart/form-data by buffering it:

const res = await api.post('add', {
searchParams: toUrlSearchParams({
'stream-channels': true,
...options,
progress: Boolean(progressFn)
}),
timeout: options.timeout,
signal: options.signal,
...(
await multipartRequest(input, options.headers)
)

return {
headers: merge(headers, {
'Content-Type': `multipart/form-data; boundary=${boundary}`
}),
body: await toStream(streamFiles(source))
}
}

async function * bufferise (source) {
for await (const chunk of source) {
if (Buffer.isBuffer(chunk)) {
yield chunk
} else {
yield Buffer.from(chunk)
}
}
}
return toBuffer(bufferise(it))

This is problematic for webui for instance as some files may not fit into memory or exceed browser quota.


Modern browsers do have native FormData and File / Blob primitives that if used carefully can facilitate streaming. It appears that in the past browser loaded all of the FormData / File / Blob content during fetch despite not reading it. However that evidently had being fixed and modern browsers (Tested with Firefox, Chrome, Safari) no longer exhibit that problem.

Following example was used to verify this behavior
https://ipfs.io/ipfs/QmWVrTAeA1FqRSK3owpCfvB69wsz2Dbr2THK6ehn5uEKbp

On my Mac I do not observe visible increase memory when I uploading 5 Gig file (across mentioned browsers). I also observe streaming behavior as as file size change in WebUI is changing during upload:

Screen Shot 2020-05-08 at 3 33 55 PM

Screen Shot 2020-05-08 at 3 36 54 PM


This investigation suggests that there is an opportunity to avoid buffering overhead in browsers. It also appears that doing it for ipfs.files.write should be fairly straight forward since it always writes a single file.

For ipfs.add things appear more complicated, as headers need to be added to each multipart part (E.g. unix mode & mtime). Some investigation is required to see how visible that is with native FormData. There are also some concerns about supporting directory structures.

@Gozala Gozala added the need/triage Needs initial labeling and prioritization label May 11, 2020
@Gozala
Copy link
Contributor Author

Gozala commented May 11, 2020

There are also some concerns about supporting directory structures.

As per HTTP API docs

Directory parts have a special content type application/x-directory. These parts do not carry any data. The part headers look as follows:

Content-Disposition: form-data; name="file"; filename="folderName"
Content-Type: application/x-directory

I suppose that is the problem, that can't be immediately addresses via browser APIs. However I wonder if it is worth consider browser limitation in HTTP API so that such metadata could be encoded differently. Given that new FormData().append(name, content, fileName) provides a way to specify name and filename metadata I see two options:

  1. filename may contains path delimiter /. Is not that good enough clue that it is a directory entry ?
  2. Maybe instead of name="file" something else e.g. name="directory" could be used ?

@Gozala
Copy link
Contributor Author

Gozala commented May 11, 2020

Did some investigation which leads me to believe that directory structures are not a problem. Following code

let data = new FormData()
let dir = new File([], 'folderName', {
  lastModified: new Date('1995-12-17T03:24:00'),
  type: 'x-directory'
})
let file = new File(['could uploded file'], 'folderName/demo.txt', {
  lastModified: new Date('1995-12-18T03:24:00'),
  type: 'plain/text'
})
data.append('file', dir)
data.append('file', file)


fetch('/post', {
  method: "POST",
  body: data
})

Produces following request body:

-----------------------------1783252016325641052066437817
Content-Disposition: form-data; name="file"; filename="folderName"
Content-Type: x-directory


-----------------------------1783252016325641052066437817
Content-Disposition: form-data; name="file"; filename="folderName:demo.txt"
Content-Type: plain/text

could uploded file
-----------------------------1783252016325641052066437817--

However, it also turns out that lastModified metadata is ignored.

@Gozala
Copy link
Contributor Author

Gozala commented May 11, 2020

Now as far as I can tell there are three other headers (mode, mtime, mtime-nsecs) currently being used (although not documented in HTTP API docs):

if (mode !== null && mode !== undefined) {
yield `mode: ${modeToString(mode)}\r\n`
}
if (mtime != null) {
const {
secs, nsecs
} = mtimeToObject(mtime)
yield `mtime: ${secs}\r\n`
if (nsecs != null) {
yield `mtime-nsecs: ${nsecs}\r\n`
}
}

I wonder if we could allow providing metadata as prefix of the content e.g. following code

let data = new FormData()
let dir = new File([], 'folderName', {
  lastModified: new Date('1995-12-17T03:24:00'),
  type: 'x-directory'
})

let rawFile = new File(['could uploded file'], 'demo.txt')
let stat = [
  `mode: 0666`,
  `mtime: ${new Date('1995-12-18T03:24:00').getTime()}`,
  `\r\n`
]

let fileWithMeta = new Blob([stat.join('\r\n'), rawFile], {
  type: 'x-file'
})
data.append('file', dir)
data.append('file', fileWithMeta, 'folderName/demo.txt')


fetch('/post', {
  method: "POST",
  body: data
})

Produces following request body

-----------------------------72860279640036950501085036757
Content-Disposition: form-data; name="file"; filename="folderName"
Content-Type: x-directory


-----------------------------72860279640036950501085036757
Content-Disposition: form-data; name="file"; filename="folderName/demo.txt"
Content-Type: x-file

mode: 0666
mtime: 819285840000

could uploded file
-----------------------------72860279640036950501085036757--

This also would not require breaking change to the HTTP API, but rather backwards compatible extension that recognizes x-file content type as encoding of headers + content.

@Gozala
Copy link
Contributor Author

Gozala commented May 11, 2020

Alternatively we could implement our own FormData encoder that uses Blob API and avoid reading underlying data e.g:

class FormDataWriter {
  parts: Blob[]
  constructor() {
    this.parts = []
  }
  append(name, value, filename, headers = {}) {
    const partHeaders = []
    const partFilename =
      filename != null
        ? encodeURIComponent(filename)
        : value instanceof Blob
        ? "blob"
        : value instanceof File
        ? encodeURIComponent(value.name)
        : null

    const contentDisposition =
      partFilename == null
        ? `form-data; name="${name}"`
        : `form-data; name="${name}"; filename="${partFilename}"`

    partHeaders.push(`Content-Disposition: ${contentDisposition}`)

    // File is subclass of Blob so we don't need to check for that.
    // Also if `ContentType` header is passed no need for adding one.
    if (value instanceof Blob && headers["Content-Type"] == null) {
      partHeaders.push(`Content-Type: ${value.type}`)
    }

    for (const [name, value] of Object.entries(headers)) {
      partHeaders.push(`${name}: ${value}`)
    }

    this.parts.push(new Blob([partHeaders.join("\r\n"), "\r\n", value]))
  }
  toBlob(boundary = `-----------------------------${nanoid()}`) {
    const chunks = []
    for (const part of this.parts) {
      chunks.push(`--${boundary}\r\n`)
    }
    chunks.push(`\r\n--${boundary}--\r\n`)
    return new Blob(chunks)
  }
}

With that former example could be updated to include custom headers with parts:

var data = new FormDataWriter()
let dir = new File([], 'folderName', {
  lastModified: new Date('1995-12-17T03:24:00'),
  type: 'x-directory'
})

let file = new File(['could uploded file'], 'demo.txt')

data.append('file', dir)
data.append('file', file, 'folderName/demo.txt', {
  mode: '0666',
  mtime: file.lastModified,
})

let boundary = `-----------------------------${nanoid()}`
fetch('/post', {
  method: "POST",
  body: data.toBlob(boundary),
  headers: {
    'Content-Type': `multipart/form-data; boundary=${boundary}`
  }
})

Which will produces following request body:

-------------------------------UFeaHa0FIdCE_8YhUw-5T
Content-Disposition: form-data; name="file"; filename="blob"
Content-Type: x-directory

-------------------------------UFeaHa0FIdCE_8YhUw-5T
Content-Disposition: form-data; name="file"; filename="folderName%2Fdemo.txt"
Content-Type: 
mode: 0666
mtime: 1589225417186

could uploded file
-------------------------------UFeaHa0FIdCE_8YhUw-5T--

However there is a major downside to this approach:

  • It implies that any user of IPFS HTTP API needs to import this or other custom FormData encoder to use the API. Which is why I think it would be better to take into account limitations of the web platform and make HTTP API compatible with it (as in proposed in previous comment, or possibly something else).

@lidel
Copy link
Member

lidel commented May 26, 2020

I decided to write this here instead of #3033 as this seems to be more on topic (streaming in browser during HTTP POST)

We use ipfs.add in both ipfs-companion and ipfs-webui (cc @rafaelramalho19).
Both apps use old API, but there is wip for switching to async iterators version.

The problem of buffering entire thing by ipfs-http-client (#2863) is still present in the new API and is really painful, as it impacts initial user experience when interacting with Web UI.

My thoughts on the matter:

  • It is CRITICAL to leverage FormData / File / Blob to facilitate streaming uploads in ipfs-http-client running in the browser
    • as noted by @Gozala above, it seems vendors fixed it and it is now possible to do it
    • ipfs.add and ipfs.files.write should just "do the right thing"
      • @achingbrain what are your thoughts if we special-case inputs with those types and use browser-specific normalization that does not break streaming? Note the idea here is to keep existing API intact (unlike Discuss: improving ipfs.add  #3033 which is about creating something new)
  • We are missing reliable progress reporting
    • right now ipfs.add reports 0% and then jumps to 100% after entire thing is accepted on the other end
  • Additional metadata attributes (mode and mtime) are optional:
    • we call them "unixfsv1.5" internally
    • context: Adding additional file metadata to UnixFSv1 specs#217
    • afaik unixfsv1.5 is not even implemented by go-ipfs (only js-ipfs in Node atm)
    • the way I see it:
      • update core-api docs to indicate that those are optional and opt-in
      • most of the users don't need them – if the choice is between supporting them and streaming upload in browser, we should pick streaming and update docs noting those attributes are supported only in Node for now.

Does this sound sensible?

@Gozala
Copy link
Contributor Author

Gozala commented May 26, 2020

  • We are missing reliable progress reporting

    • right now ipfs.add reports 0% and then jumps to 100% after entire thing is accepted on the other end

It still possible to get progress reporting by using old xhr.onreadystate progress events. API is not as ergonomic as fetch but it can send same data types, although I have not verified that it does not load whole thing into memory before upload. I'm assuming it will stream, but needs verification to be sure.

@Gozala
Copy link
Contributor Author

Gozala commented May 26, 2020

Additional metadata attributes (mode and mtime) are optional:

I do not believe we need to choose between metadata and no buffering. I believe my last two comments illustrate how we could get both. That being said I believe it would require either:

  1. Some changes to the HTTP API so metadata is encoded / provided differenly.
  2. Custom FormData encoder implementation that builds up request body as a blob.

It is worth considering trade-offs between those two, I personally would prefer to keep client as lean & simple possible so that use of HTTP API requires no client library, it's just there for the convenience.

@lidel
Copy link
Member

lidel commented Jun 24, 2020

Breaking changes to HTTP API /api/v0 are something we should avoid, if possible.
(People still struggle with switch to POST-only API, slowing down lib upgrades etc)

If support for opt-in mode and mtime are the only reason to change HTTP API, I'd go as far as to say that we could simply not support those in browser context and nobody would care. AFAIK it is a niche feature used in CLI/daemon contexts.

But the need for a fix is there.
I often stumble upon new prior art of userland support/workarounds for FormData etc:


I believe that, as a bare minimum, we need to improve ipfs-http-client's ipfs.add in browser as part of Pinning Service integration (ipfs/ipfs-gui#91) work in Q3.

Importing files with ipfs-webui (over HTTP API) should be fast and work with files >4GB.
(UC: I want to drag&drop my 4K video into WebUI and then pin it to a remote service)

@lidel lidel added env:browser P1 High: Likely tackled by core team if no one steps up exp/wizard Extensive knowledge (implications, ramifications) required kind/maintenance Work required to avoid breaking changes or harm to project's status quo topic/perf Performance labels Jun 24, 2020
@Gozala
Copy link
Contributor Author

Gozala commented Jun 24, 2020

Breaking changes to HTTP API /api/v0 are something we should avoid, if possible.
(People still struggle with switch to POST-only API, slowing down lib upgrades etc)

If support for opt-in mode and mtime are the only reason to change HTTP API, I'd go as far as to say that we could simply not support those in browser context and nobody would care. AFAIK it is a niche feature used in CLI/daemon contexts.

I should point out that changes I mention are not breaking changes, it would just allow passing those options in an alternative way. If users choose to not pass those, or pass them in an old fashion it should still work.

@Gozala Gozala self-assigned this Jun 24, 2020
@Gozala
Copy link
Contributor Author

Gozala commented Jun 24, 2020

I will try to drive this effort as my time allows (right now I'm blocked on reviews). With that I intend to do following:

  1. Add this item to the agenda of IPFS Core Implementations Weekly meeting.
  2. Schedule a Design Review Proposal
    • We need a decision on what if any changes could be done to HTTP API to make it compatible with browser native FormData.
    • Get a consensus on how ipfs.add / ipfs.file.write should behave here.
    • How do we tests this ? I'm hesitant to adding 4gig of data to our tests.

As with #3022 I remain very skeptical of keeping ipfs.add API as is in a hope that provided input is going to allow for optimizations described. That is because it is just too easy to pass an input that can defeat optimizations described here.

@achingbrain
Copy link
Member

Unless I'm missing something I think the simplest thing to do here is to write a browser-specific version of /packages/ipfs-http-client/src/lib/multipart-request.js that probes the input - if it's a Blob or { content: Blob } or an iterable (not async) of either of them, use the native FormData object as proposed here. If it's not, fall back to the default normalisation. That should cover the drag & drop use case @lidel is interested in and we can apply further optimisations later down the line as and when necessary.

How do we tests this ? I'm hesitant to adding 4gig of data to our tests.

There's always the age old hack of using it-buffer-stream or something similar to generates buffers filled with 0s - 4 gigs of data will be transferred but IPFS block de-duping will only write one block to the blockstore and no data files need to be added to the tests.

@thienpow
Copy link

thienpow commented Jul 7, 2020

i could have this working in Safari, with near 300mb video, crash in chrome, brave, firefox...
i think Safari is having 300mb buffer limit and the rest of the browser having 150mb buffer limit.

i think FormData is a must for browser... all the buffer stream method won't work in browser when it reach 300mb.
you see the Safari complaint about memory when it is buffering the last file which is 271mb itself.

https://youtu.be/J4CvwfRpHeA


  const updateProgress = (byteSent, fileName, fileSize) => {
    uploadStatus = uploadStatus + "."
    percent = (byteSent / fileSize) * 100
    //console.log(percent + "%")
  }
  
  const createFile = (file) => {
    return {
      path: file.name,
      // content could be a stream, a url, a Buffer, a File etc
      content: file
    }
  }

  const streamFiles = async (node, files) => {

    let chunckSize = 1048576;

    for (const file of files) {
      uploadStatus = uploadStatus + "\n"
      
      const fileName = file.name
      const fileSize = file.size

      // Create a stream to write files to
      const stream = new ReadableStream({
        start(controller) {
          controller.enqueue(createFile(file))
          // When we have no more files to add, close the stream
          controller.close()
        }
      }, {
        highWaterMark: chunckSize
      })


      for await (const data of node.add(stream, {
        pin: false,
        chunker: 'size-' + chunckSize,
        wrapWithDirectory: false,
        progress: (byteSent) => updateProgress(byteSent, fileName, fileSize)
      })) {

        //reaching end, will return from there.
        if (data.path.indexOf('/') === -1 && data.path !== '') {
            
          try {

            uploadStatus = uploadStatus + "\n" + data.path + ": " + data.cid

            const src = "/ipfs/" + data.cid
            const dst = "/" + data.path
            node.files.cp(src, dst)

          } catch (err) {
            //throw Object.assign(new Error('Folder already exists.'), { code: 'ERR_FOLDER_EXISTS' })
          }
        }
      }
    }
    
  }

@lidel
Copy link
Member

lidel commented Jul 22, 2020

For anyone following along, there are two PRs that aim to (among other things) improve performance of ipfs.add in ipfs-http-client in browser context by addressing findings from this issue in alternative ways:

@Gozala
Copy link
Contributor Author

Gozala commented Jul 27, 2020

Was fixed by #3184

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
env:browser exp/wizard Extensive knowledge (implications, ramifications) required kind/maintenance Work required to avoid breaking changes or harm to project's status quo P1 High: Likely tackled by core team if no one steps up topic/perf Performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants