Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to download inline links directly after their parent page when downloading recursively #2

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

john-peterson
Copy link
Member

because it's more likely to download temporary links before they expire because it's more similar to the browsing experience

difference from browsing experience cause problem

the problem is described in #1

discarded LIFO solution

a convoluted LIFO solution with a similar result (the difference is that LIFO result in a bottom-to-top download order) is in #1

non-inline links aren't prepended

ATTR_HTML links are never prepended and might contain non-html files that aren't downloaded directly after its parent page even with browser queue type

if these cause a problem an option to read the header Content-Type of the link before enqueuing could be added

test

testenv

this check the link order

./Test--spider-r-browser.py

Test Passed.

changing browser to fifo in Test--spider-r-browser.py fail the test

./Test--spider-r-browser.py

Expected: [['HEAD /', 'GET /', 'GET /robots.txt', 'HEAD /image.svg', 'HEAD /secondpage.html', 'GET /secondpage.html', 'HEAD /nonexistent', 'HEAD /thirdpage.html', 'GET /thirdpage.html', 'HEAD /dummy.txt', 'HEAD /againnonexistent']]
Got: [['HEAD /', 'GET /', 'GET /robots.txt', 'HEAD /secondpage.html', 'GET /secondpage.html', 'HEAD /image.svg', 'HEAD /nonexistent', 'HEAD /thirdpage.html', 'GET /thirdpage.html', 'HEAD /dummy.txt', 'HEAD /againnonexistent']]
Error: Not all files were crawled correctly..

download order

this show the download order for the test page in #1 /test

the current code sometimes download links long after its parent page (see #1 /test)

this patch download links directly after its parent page

wget -vdrp -nd --queue-type=browser http://localhost/code/html/test/download/i.html 2>&1 | egrep "^Appending|^Prepending|Dequeuing|Saving to"

Appending http://localhost/code/html/test/download/i.html at depth 0
Dequeuing http://localhost/code/html/test/download/i.html at depth 0
Saving to: ‘i.html’
Appending http://localhost/code/html/test/download/a/a.html at depth 1
Prepending http://localhost/code/html/test/download/x.jpg at depth 1
Appending http://localhost/code/html/test/download/b/b.html at depth 1
Prepending http://localhost/code/html/test/download/y.jpg at depth 1
Dequeuing http://localhost/code/html/test/download/y.jpg at depth 1
Saving to: ‘y.jpg’
Dequeuing http://localhost/code/html/test/download/x.jpg at depth 1
Saving to: ‘x.jpg’
Dequeuing http://localhost/code/html/test/download/a/a.html at depth 1
Saving to: ‘a.html’
Appending http://localhost/code/html/test/download/a/a/a-a.html at depth 2
Prepending http://localhost/code/html/test/download/a/a-x.jpg at depth 2
Appending http://localhost/code/html/test/download/a/b/a-b.html at depth 2
Prepending http://localhost/code/html/test/download/a/a-y.jpg at depth 2
Dequeuing http://localhost/code/html/test/download/a/a-y.jpg at depth 2
Saving to: ‘a-y.jpg’
Dequeuing http://localhost/code/html/test/download/a/a-x.jpg at depth 2
Saving to: ‘a-x.jpg’
Dequeuing http://localhost/code/html/test/download/b/b.html at depth 1
Saving to: ‘b.html’
Appending http://localhost/code/html/test/download/b/a/b-a.html at depth 2
Prepending http://localhost/code/html/test/download/b/b-x.jpg at depth 2
Appending http://localhost/code/html/test/download/b/b/b-b.html at depth 2
Prepending http://localhost/code/html/test/download/b/b-y.jpg at depth 2
Dequeuing http://localhost/code/html/test/download/b/b-y.jpg at depth 2
Saving to: ‘b-y.jpg’
Dequeuing http://localhost/code/html/test/download/b/b-x.jpg at depth 2
Saving to: ‘b-x.jpg’
Dequeuing http://localhost/code/html/test/download/a/a/a-a.html at depth 2
Saving to: ‘a-a.html’
Prepending http://localhost/code/html/test/download/a/a/a-a-x.jpg at depth 3
Prepending http://localhost/code/html/test/download/a/a/a-a-y.jpg at depth 3
Dequeuing http://localhost/code/html/test/download/a/a/a-a-y.jpg at depth 3
Saving to: ‘a-a-y.jpg’
Dequeuing http://localhost/code/html/test/download/a/a/a-a-x.jpg at depth 3
Saving to: ‘a-a-x.jpg’
Dequeuing http://localhost/code/html/test/download/a/b/a-b.html at depth 2
Saving to: ‘a-b.html’
Prepending http://localhost/code/html/test/download/a/b/a-b-x.jpg at depth 3
Prepending http://localhost/code/html/test/download/a/b/a-b-y.jpg at depth 3
Dequeuing http://localhost/code/html/test/download/a/b/a-b-y.jpg at depth 3
Saving to: ‘a-b-y.jpg’
Dequeuing http://localhost/code/html/test/download/a/b/a-b-x.jpg at depth 3
Saving to: ‘a-b-x.jpg’
Dequeuing http://localhost/code/html/test/download/b/a/b-a.html at depth 2
Saving to: ‘b-a.html’
Prepending http://localhost/code/html/test/download/b/a/b-a-x.jpg at depth 3
Prepending http://localhost/code/html/test/download/b/a/b-a-y.jpg at depth 3
Dequeuing http://localhost/code/html/test/download/b/a/b-a-y.jpg at depth 3
Saving to: ‘b-a-y.jpg’
Dequeuing http://localhost/code/html/test/download/b/a/b-a-x.jpg at depth 3
Saving to: ‘b-a-x.jpg’
Dequeuing http://localhost/code/html/test/download/b/b/b-b.html at depth 2
Saving to: ‘b-b.html’
Prepending http://localhost/code/html/test/download/b/b/b-b-x.jpg at depth 3
Prepending http://localhost/code/html/test/download/b/b/b-b-y.jpg at depth 3
Dequeuing http://localhost/code/html/test/download/b/b/b-b-y.jpg at depth 3
Saving to: ‘b-b-y.jpg’
Dequeuing http://localhost/code/html/test/download/b/b/b-b-x.jpg at depth 3
Saving to: ‘b-b-x.jpg’

@john-peterson
Copy link
Member Author

prepend non-HTML links

I suggest
    while (1)
        url_dequeue

        if (descend)
            for (; child; child = child->next)
                if child is HTML
                  url_enqueue_append
                else
                  url_enqueue_prepend

k i get it. thats a better solution. i pushed it to #2

bandwidth

BTW, since you are talking about timing, people with different bandwidths will/might experience different results when viewing such a page. I use mobile internet with 2G/3G <= 64kbit/s. I wonder how that works ;-)

low bandwidth makes this patch more important because low bandwidth increase the time between parent and child download

and most pages should support low bandwidth browsing. f.e. imagevenue probably have a long enough timeout to allow img.php to load its image with low bandwidth too

@john-peterson
Copy link
Member Author

git patch

in order to add your patch to wget, could you please send the complete patch in git's format-patch format ?

its in https://github.com/mirror/wget/pull/2.patch

@john-peterson
Copy link
Member Author

make it optional

What about the option to switch the behaviour + docs.

i can add that. what do u wanna call the option?

@john-peterson
Copy link
Member Author

make it optional

you already have that in your 'recurse' branch. Take that, undo the changes to src/recur.c and take your changes from 'recurse2' branch.

k it's done https://github.com/mirror/wget/pull/2/files

debug message

you would also have to add the relevant debug statements to ensure that when we look at a debug log, we know exactly what is happening. The current DEBUGP message will have to be changed to reflect the new working.

i changed the DEBUGP that say Enqueuing to Appending/Prepending to tell whats happening

test

I would like to see a test case that shows how the two differ.

there's a test in #2 (comment) /test that compare them

i release the copyright

You will be required to sign the copyright assignment agreement with the FSF before we can merge this patch into our codebase.

sounds like a lot of work. i release the copyright. i dont care

@john-peterson john-peterson changed the title Prepend non-HTML links when downloading recursively Add option to prepend non-HTML links when downloading recursively Jan 28, 2015
@john-peterson john-peterson changed the title Add option to prepend non-HTML links when downloading recursively Making all in testenv Jan 28, 2015
@john-peterson john-peterson changed the title Making all in testenv Add option to download non-HTML links directly after their parent page when downloading recursively Jan 28, 2015
@john-peterson john-peterson changed the title Add option to download non-HTML links directly after their parent page when downloading recursively Add option to download links directly after their parent page when downloading recursively Jan 28, 2015
@john-peterson
Copy link
Member Author

git patch

Please share the patch directly with us, in git format-patch style.

its in https://github.com/mirror/wget/pull/2.patch

testenv

No, I meant a live test for the Wget Test Suite. You'll find the test suite in the testenv/ directory. We need to add a test that ensures that the Wget downloads files in the correct order as expected.

i added a test that fail unless the browser queue order is used

@john-peterson
Copy link
Member Author

wget it

Please attach the patch to the email you send to this list. NOT a link to the patch. Not all of us use a browser for looking at emails.

u can download it with wget

wget https://github.com/mirror/wget/pull/2.patch

@john-peterson
Copy link
Member Author

attach it

Still not going to review the patch. Turns out, I don't know how to use the command you gave me. All I do know is, if the patch were attached to the email, I'd be able to read it.

k i attached it

@john-peterson john-peterson changed the title Add option to download links directly after their parent page when downloading recursively Add option to download inline links directly after their parent page when downloading recursively Jan 30, 2015
it's supposed to test the element order but symmetric_difference ignore that
…when downloading recursively

because it's more likely to download temporary inline links before they expire because it's more similar to the browsing experience
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant