Anon 03/11/2024 (Mon) 09:16 No.9812 del
>>9810
Another method would be something like what ArchiveTeam does. Somehow make a tracker or something to determine which URLs are 4xx and which are good (redirect or HTTP 200). wpull.log shows HTTP statuses. I didn't see HTTP statuses in grab-site cdx but I did see it in wget/wpull cdx. Will have to wait hours or days for the anti-crawl verification thing to go away on that site for me. Or just solve that crap. At least I know that that site does that.

Next problem would be getting the outlinks of every non-blocked webpage download. Ideally a website would not block a grab at all, that way the program could run for days and get everything without duplicates (then the problem is after-grab additions to the website). With each program termination there's consideration about what was or wasn't downloaded and of what was download, did you get all of the outlinks?

100 GB for that site seems around correct; I guess that it's smaller.

>>9789
>>9790
From, partly in this folder:
https://web.archive.org/web/20231006071515/https://gateway.pinata.cloud/ipfs/QmWoksjPMhioHyDssntht6rjHaufcPZPWYXGqt2wJ669Yo/0a5fb09/

----

"My Little Pony Season 5 Episode 26-720p.mp4" for 1 day:
https://put.icu/bc7yug2t.mp4