Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dev: do not store images in Git #1115

Open
shcheklein opened this issue Jan 28, 2020 · 22 comments
Open

dev: do not store images in Git #1115

shcheklein opened this issue Jan 28, 2020 · 22 comments
Labels
A: website Area: website p2-nice-to-have Less of a priority at the moment. We don't usually deal with this immediately. status: research Writing concrete steps for the issue type: enhancement Something is not clear, small updates, improvement suggestions website: eng-doc DEPRECATED JS engine for /doc

Comments

@shcheklein
Copy link
Member

shcheklein commented Jan 28, 2020

Clone takes forever, at some point we will hit the GH limits.


As an experiment we can do see what would it take to integrate DVC to store images, use GraphQL to fetch them. POC here - https://github.com/iterative/blog/pull/93

After that we can amend Git repo to remove all image objects.

Any other ideas?

@fabiosantoscode
Copy link
Contributor

@shcheklein where should we store the images? Should we use an S3 bucket with public access and serve them from there?

@shcheklein
Copy link
Member Author

@fabiosantoscode for the blog.dvc.org specifically? pick any remote storage to experiment with. Probably will be storing on S3 when it's done.

@fabiosantoscode
Copy link
Contributor

fabiosantoscode commented Jan 28, 2020

@shcheklein I've looked deeply into gatsby hooks, the source-filesystem plugin (which can generate files from Buffer objects as well as real files) and decided to simplify to the max and work up from there.

So I used dvc import-url to place an example image in a subdirectory of static. I can query it with GraphQL just fine. I place the DVC files in a single subdirectory "dvc-img".

Plus, you don't need to restart gatsby develop when I import a new image because since the images are in the real filesystem, they are taken in by gatsby just like any other images would, and therefore they are available from GraphQL.

All that's missing, I think, is running dvc pull using some git hook (using husky), or whenever gatsby starts a build or a dev server.

Am I on the right track? I don't see too many problems with this approach, and it's pretty simple so it shouldn't be too hard to maintain.

Edit: I'm not tackling image optimization itself since gatsby has plugins for that. I think it's OK to store large images on S3 and have them be optimized on build time.
Edit 2: It might be better to place the DVC files next to the images, but it might be my dislike of complex directory structures talking.

@fabiosantoscode
Copy link
Contributor

An interesting problem we have now, is third party contributions. If a regular user were to create a post they would have to give us the image off-band.

<aside>
Sending data over a PR could be a cool problem for iterative and DVC to solve. Storing images in some secure iterative-controlled content-addressible storage and PR a "data PR DVC file" which contains its sha512.

Once the PR is merged that "data PR DVC file" can be turned into a regular old import-url DVC file by an insider with s3/google drive permissions (or a github action!), who simply runs a command, which downloads it from iterative and places it in the desired remote. It might also help coworkers "branch off" their large data files, and then have them imported back into master with ease.

It's still possible for a team to do this by hand by making the files available on some personal HTTP server or public S3 bucket and adding a import-url DVC file with the s3 public URL to the file. It does add some additional steps on the merge side though.
</aside>

@shcheklein
Copy link
Member Author

So I used dvc import-url to place an example image in a subdirectory of static. I can query it with GraphQL just fine. I place the DVC files in a single subdirectory "dvc-img".

not sure I understand why would you need dvc import-url, why not just dvc add directory-with-iamges?

All that's missing, I think, is running dvc pull using some git hook (using husky), or whenever gatsby starts a build or a dev server.

yep, that's the most interesting part - how to automate and hook it nicely into the workflow. Write a plugin that detects that images are stored in DVC and resolves/pulls them programmatically? Also, how do we add more images ... can we make it part of the build (detect changes and save dvc push them)?

Sending data over a PR could be a cool problem for iterative and DVC to solve

Agreed! Let's keep this in mind and think. But probably for now can solve this for the internal team first.

@fabiosantoscode
Copy link
Contributor

@shcheklein dvc add is probably better, yes.

I can create this plugin. If the image DVC files are stored next to the images it's a lot easier (or if the DVC file represents the whole image subdirectory). I can detect changes in the DVC file(s) and run dvc pull automatically. This can detect branch changes and whatnot.

Automatic pull has a drawback though. If you change a DVC file after changing an image, you lose the changes you made. After reading through the docs a bit, I don't think there's a way to pull a DVC file that avoids overwriting novel data that might be in the working directory. It's probably safe to assume people will have an extra copy of the images they're adding, so this is probably fine to ignore for now.

As per writing images, I don't think it would be very good to track images automatically on build time (if you crop, resize or color correct an image 10 times before committing and your dev server is running, we get 9 unused versions of that image sitting on S3).

Instead, simply asking users to run dvc install and making sure it has been installed as a pre-commit check, might be enough. If I correctly understand what the hooks do, I think they can effectively make sure that if you're pushing any git changes, changes to data files are pushed as well. If not, we can probably augment them with our own hooks.

@shcheklein
Copy link
Member Author

or if the DVC file represents the whole image subdirectory

probably this should scale better, DVC is not very well suited for thousands and thousands of DVC-files - it might become a bottleneck. Also, dvc pull can do "granular" partial pulls for a file in the directory.

Automatic pull has a drawback though. If you change a DVC file after changing an image, you lose the changes you made.

not sure why. You don't change DVC-file. It's being updated by running dvc add or dvc run.

we get 9 unused versions of that image sitting on S3

this is fine, we can do dvc gc from time to time

if you crop, resize or color correct an image 10 times before committing

it's a pretty regular thing that takes a lot of time, btw in Gatsby. Can we use DVC pipelines there? to save prepared processed images into cache?

Instead, simply asking users to run dvc install and making sure it has been installed as a pre-commit check, might be enough

we can do any simple option first for the update experience.

@fabiosantoscode
Copy link
Contributor

The case where you lose your image changes is mostly when you change branches, which changes the DVC file. It shouldn't be a big deal.

Using DVC pipelines sounds like a good idea, but I think there are some gatsby plugins for compressing images which use the gatsby cache to keep things up to date. Either way I don't want to tackle this before the rest of the experience is solid.

Thanks for clearing up my questions, I'll get right on this.

@shcheklein
Copy link
Member Author

@fabiosantoscode it won't allow you to do dvc checkout if it means losing uncached files.

use the gatsby cache to keep things up to date.

yep, the question is - can Netlify and/or Gatsby CI/CD setup in a way to use this cache easily?
in case of DVC cache is saved to S3 and even other developers/editors can get it and benefit potentially.

Either way I don't want to tackle this before the rest of the experience is solid.

yep, that's right.

@fabiosantoscode
Copy link
Contributor

To fix this I think we should cache the .cache and public folders across builds. It's trivial in most CIs but it doesn't seem to be documented on netlify. They do seem to cache node_modules so we might just piggy-back on that and copy the cache to and from node_modules/.tmp before and after each build. That might just work. Some netlify forum discussions mention this repository

@shcheklein shcheklein transferred this issue from another repository Apr 6, 2020
@shcheklein shcheklein changed the title do not store images in git do not store images in Git Apr 6, 2020
@shcheklein shcheklein added A: website Area: website type: enhancement Something is not clear, small updates, improvement suggestions p1-important Active priorities to deal within next sprints status: research Writing concrete steps for the issue labels Apr 6, 2020
@shcheklein shcheklein changed the title do not store images in Git performance, dev: do not store images in Git Apr 6, 2020
@jorgeorpinel jorgeorpinel changed the title performance, dev: do not store images in Git dev: do not store images in Git Mar 15, 2022
@jorgeorpinel jorgeorpinel added the website: eng-doc DEPRECATED JS engine for /doc label Mar 15, 2022
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Mar 15, 2022

Hi @iterative/websites ! I just realized this p1 issue has been waiting for a while. Should we plan it up?

Option A - Use DVC (probably with an S3 bucket)
B - Consider Git-LFS since it's integrated to Github and we're aiming to use the specific, narrow use case of large file storage (not really a data management need).
C - ?

Either way it's be nice to have it setup automatically with yarn for dev envs.

Cc @casperdcl didn't you propose a related solution for another repo? Thanks

@yathomasi
Copy link
Contributor

Thanks, @jorgeorpinel for bringing this up, we need more discussion on this. We have already discussed it a little on slack and have also mentioned this on #docs meeting but haven't reached a conclusion overall.

Option A - Use DVC (probably with an S3 bucket)

  • As @casperdcl mentioned we would need to address this issue:

one problem is making sure the S3 etc. storage that DVC uses is publicly readable but only writable by us. Then the next problem is how do we handle external contributions? How do they write to our storage?

B - Consider Git-LFS since it's integrated to Github and we're aiming to use the specific, narrow use case of large file storage (not really a data management need).

For this we need to be okay with : why-you-shouldn't-use-git-lfs

C - ?

Possible option C was I mentioned using some content management tools(cms for content as well as media management):
we have pros and cons and little discussion on slack

If we choose to go with one of the above:
Option A:

  • I can collaborate with someone who knows DVC well and submit a pr

Option B:

  • I can submit a pr on this

Option C:

  • If we do consider cms, we would need to find a suitable one. popular headless cms with Gatsby
  • This one will have other advantages as well but would take some significant development time.

@casperdcl
Copy link
Contributor

casperdcl commented Mar 15, 2022

propose a related solution for another repo

As I recall I'd also suggested an option of using a submodule (e.g. https://github.com/iterative/static) for images (so that the base repo doesn't bloat in size).

Cons: Would require people to shallow-clone/checkout/init submodules but maybe easiest? From discussion with @shcheklein I also recall the gatsby build will also have to checkout the submodule in order to perform image optimisations.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Mar 15, 2022

one problem is making sure the S3 etc. storage that DVC uses is publicly readable but only writable by us

Quite doable: you config up 2 remotes to the same storage, the read-only one via HTTP in .dvc/config and the one for writing in .dvc/congif.local (needs to be setup manually as part of the dev env setup, which should be documented in the README and/or here.)

Indeed the LFS integration with GH makes this easier for devs though.

how do we handle external contributions? How do they write to our storage?

Good question. It's not a common case though so maybe not to worry about? In the serare cases they'd have to upload it to the PR during review and we'd process images appropriately once it's approved, before squash & merge.

Maybe GH/LFS also makes this easier? If so I'd have to incline for LFS at this point.

we need to be okay with : why-you-shouldn't-use-git-lfs

We'd want to rewrite the whole repo history to remove image files with either option I think. I don't think it's a problem for a web app -- we never deploy much older versions?

Possible option C was I mentioned using some content management tools(cms

Interesting but how does it integrate with the existing engine? Do we want to manage 2 web apps? Sorry, I can't open the Slack links (feel free to @ me directly on Slack though). BTW we'd still want to rewrite history and remove all images from the repo, I believe.

option of using a submodule

Ugh submodules. A reason not to use Git is that it's "not supposed" to manage binaries anyway, right? The submodule may still take a long time to clone, etc. and complicates things for devs/contributors anyway.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Mar 15, 2022

Possible option C was I mentioned using some content management tools(cms

Interesting but how does it integrate with the existing engine? Do we want to manage 2 web apps?

💡 idea to explore: (bit hacky but) we could try using our existing Discuss site to upload images and load them from there into docs. We do something like that for all blog posts: to create a forum for comments. I believe the images would be served up by our CDN anyway (we should check).

@yathomasi
Copy link
Contributor

yathomasi commented Mar 21, 2022

Maybe GH/LFS also makes this easier? If so I'd have to incline for LFS at this point.

Since we deploy our website on Heroku(I couldn't find mention of support for git-lfs), there might be a problem during build time so we might need some workaround on Heroku to support git-lfs. But, that does come with a limitation.

Shall we try with LFS and submit a pr to see how it goes?

Interesting but how does it integrate with the existing engine?

Currently, the contents are kind of separate too. Gatsby has the flexibility to use it from different sources.

Do we want to manage 2 web apps?

Depends. We can rely on the cloud services most of them provide.

Possible option C ...

💡 idea to explore: (bit hacky but) we could try using our existing Discuss site to upload images and load them from there into docs...

Or, we can use image/media management services like Cloudinary which has easy integration with Gatsby too.
There are alternatives as well:

@iesahin
Copy link
Contributor

iesahin commented Mar 21, 2022

Even if there is no other option, I'd not use Git-LFS for any solution. It's like using Mercedes trucks in BMW factory. If we're going to use competitors' solutions, git-annex may be better, as it doesn't require server side changes 😃

DVC can be used with multiple remote configurations like we do in example repositories. Something like reading from https://dvc-public.s3.amazonaws.com and writing to s3://dvc-public/ might be doable. PRs will still contain images and the final versions will be uploaded by the team to S3 with their credentials.

@yathomasi
Copy link
Contributor

yathomasi commented Mar 28, 2022

It’s is just an idea. Need to discuss its feasibility.

  1. Setup Github action on pull request
  2. If there are any images on some specified folder, upload them to some image storage implementation
  3. The image storage implementation can be one of
    1. DVC to s3 storage
    2. Some images as service(with API)
    3. Directly to some s3 bucket
      1. Github Action https://github.com/marketplace/actions/s3-sync\
      2. and serve through CloudFront
  4. Once the images are uploaded, create a pull request that replaces the path of the image with a new image path and also completely removes those images (I assumed it’s possible to use git-filter-branch or git-filter-repo)
  5. The idea of a new pull request is similar to the restyled-pull requests we are using.

cc: @iterative/websites @casperdcl

@iesahin
Copy link
Contributor

iesahin commented Mar 28, 2022

If we convert dvc.org into a DVC repository, it will be possible to add image files with dvc add and we'll dvc push them to S3 with our credentials.

In publish phase, a dvc pull in the repository is required to get the files before image transformations. This can use an https remote not to allow leaking S3 credentials to Heroku or GitHub.

In our PRs, we can upload the files ourselves. Outside contributors won't be able to push though. They can include the images to PRs, and we can upload them.

Once we have a workflow to use DVC in the repository, we can automate as much as possible.

@casperdcl
Copy link
Contributor

casperdcl commented Mar 28, 2022

If there are any images on some specified folder, upload them to some image storage implementation

before this step I would have:

  1. cml send-github-check --status in_progress
  2. manual approval (see https://github.com/iterative/cml/blob/f3406ac28cb98aeef5cc47f4754b4c8cac428b14/.github/workflows/test-deploy.yml#L10-L19)
  3. cml send-github-check --conclusion success

Also the storage could be similar to https://assets.cml.dev :)

/CC @iterative/cml

@jorgeorpinel
Copy link
Contributor

we can use image/media management services like Cloudinary which has easy integration with Gatsby...
Imagekit.io
Imgix

Sounds interesting. I'm also of the idea to try this manually first, and automate via hooks or actions once the workflow is clear.

  1. Once the images are uploaded, create a pull request that replaces the path of the image with a new image path and also completely removes those images

I'd be ideal that the transformation can happen locally though, so that there's no need to cascade PRs some of which rewrite history of what was just merged in (sounds fragile). The question about outside contributors is good but it's not very common that they're include images in PRs so not a pressing one to answer.

@dberenbaum dberenbaum added p2-nice-to-have Less of a priority at the moment. We don't usually deal with this immediately. and removed p1-important Active priorities to deal within next sprints labels Feb 27, 2023
@dberenbaum
Copy link
Contributor

Lowering to p2 due to lack of resources for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: website Area: website p2-nice-to-have Less of a priority at the moment. We don't usually deal with this immediately. status: research Writing concrete steps for the issue type: enhancement Something is not clear, small updates, improvement suggestions website: eng-doc DEPRECATED JS engine for /doc
Projects
None yet
Development

No branches or pull requests

7 participants