Anon 01/13/2024 (Sat) 05:47 No.9326 del

>>9325
>pmvtoday.com grab died. Will try to install grab-site on a more stable computer
I installed it in the other Linux computer. It didn't work in that one. I don't feel like fixing python bullshit or whatever is causing it to not work. I'll just use the crappy unstable computer, this time ignoring already-downloaded threads. The "ignores" file has 72,778 lines/URLs; hopefully that much regexing doesn't make it work too slowly. More info below.

"./pmvtoday.com-forum-2024-01-11-4c54c2d1/wpull.log" -> vim

:g!/200 OK/d

-> make URL list and save to "1w200_1.txt" (has 104,137 lines) -> do not ignore entry point links like http://pmvtoday.com/forum/archive/index.php/forum-... -> vim

:g!/^https\?:..pmvtoday.com.forum.archive.index.php.thread-\d\+.html$/d

(29,378 fewer lines = 72,778 lines) -> ignore all threads which I already downloaded vim

:%s/^/^/g | %s/$/$/g

and save to "1w200i1.txt". Running this, has to download 25,000+ thread index webpages again, but whatever, maybe that will be faster now if they got cached due to me and stuff:
>$ utc; grab-site --import-ignores 1w200i1.txt http://pmvtoday.com/forum/ 1>2log1.txt 2>2log2.txt; utc
>2024-01-13T05:21:57.039639903Z
>[ saving to ./warc/009/pmvtoday.com-forum-2024-01-13-b8f5fc74/ ]

What's the structure of that site?
* thread #1 http://pmvtoday.com/forum/archive/index.php/thread-1.html
* thread #1 page #3 http://pmvtoday.com/forum/archive/index.php/thread-1-3.html
* thread #10 http://pmvtoday.com/forum/archive/index.php/thread-10.html
* thread #100,200 http://pmvtoday.com/forum/archive/index.php/thread-100200.html
* other URLs

I guess I have all of /forum/archive/index.php/forum- downloaded, so if I did this differently I could use no ignores but have start_urls set to the thread numbers which I didn't download. Concern: pages of threads might get skipped in some cases like with what I'm using now; possible solution = extract all .html files from .warc.gz via warcat (https://github.com/chfoo/warcat - this one I think) then "lynx -dump" in search of "thread-\d+-\d+.html" for thread-[thread_number]-[thread_page].html