Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] build() sometimes returns a bunch of URLs, sometimes not #616

Open
MrsBookik opened this issue Feb 25, 2024 · 1 comment
Open

[BUG] build() sometimes returns a bunch of URLs, sometimes not #616

MrsBookik opened this issue Feb 25, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@MrsBookik
Copy link

MrsBookik commented Feb 25, 2024

Describe the bug

build() seems not appear to be reliable. Sometimes it does what you would expect, sometimes not. Sometimes it returns tons of articule URLs, and some seconds later not, providing the same input parameter.
When I scrape cnn.com, or edition.cnn.com, which has been used in the official examples, it constatenly returns different result und subsequent calls.
On the first attempt, it returns 100s of article URLs. On the second call only 5, on the third only 2 and from then on, zero results.
Waiting a day, starting from scratch, repeats it in a similar way.

To Reproduce
Steps to reproduce the behavior, please post any code you used and the website you tried to parse/process:

cnn_paper = newspaper.Source('https://cnn.com')

print(cnn_paper.size())  # no articles, we have not built the source

cnn_paper.build()
print(cnn_paper.article_urls())

print(cnn_paper.size())


Expected behavior

it would always return a bunch of listed URLs on the cnn.com site.

Screenshots

System information

  • OS: [Windows / Linux / Macos]
  • Python version [e.g. 3.6, 3.9]
  • Library version [e.g. 0.9.0]

Additional context
Add any other context about the problem here.

@MrsBookik MrsBookik added the bug Something isn't working label Feb 25, 2024
@AndyTheFactory
Copy link
Owner

Hi,
can you try out with the new release (0.9.3)?

Also, make sure you test with memorize_articles=False?

https://newspaper4k.readthedocs.io/en/latest/user_guide/api_reference.html?highlight=memorize_#newspaper.configuration.Configuration.memorize_articles

just set it in config, or as a parameter for the Source constructor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants