There is a booru-like website that i want to archive, how do i crawl all pages and scrape the images?
There are some youtube channels that i want to follow, how do i periodically get a list of videos made by a channel and then automatically download new uploaded videos?
There are some imageboards that i want to archive, how do i periodically crawl a board and download threads and images?
Do i just scrape the HTML, parse it, and then download the images/videos?
Best way to do it programmatically? Best practices? Advices?
If it's just URLs you are looking for, a simple regex and parsing the list afterwards can do the job.
>Youtube
Channels have an RSS feed that you can read periodically. It's how you can emulate being subscribed as well.
If it's anything particularly complicated, where you actually have to parse HTML, I'd go for Python/requests/BeautifulSoup.
>Newsboat
>gallery-dl
>NewPipe
>youtube-dl+mpv(was already using this setup)
Thanks for all the suggestions.
>youtube RSS
Didn't know about it.
>>7947 >If it's just URLs you are looking for, a simple regex and parsing the list afterwards can do the job.
Can i ask what do you mean exactly?
>I'd go for Python/requests/BeautifulSoup.
Ok, i'll look into that.
I never really programmed in Python thou, anybody knows of a similar library, or stack, or method in C or C++?
I'll go with this suggestion otherwise.
>>7977 Not him, but if you wanted to scrape images of a site you could curl the site, grep for the image tags, use cut to get the urls. Now that you have a list of urls you can use it as an input file for wget to download all the images.
>C or C++
I would not recommend going that route. The tooling is just not really there / you will have to do a lot of work yourself. Besides, the script is going to be fairly easy, so you can easily pick up the Python you need to make it.
>>7977 >Can i ask what do you mean exactly?
All urls follow a similar pattern that is easy to create a regular expression for. It's always http(s):// followed by anything from [a-zA-Z0-9?-=&/]. So, instead of turning the page into a DOM tree and then traversing it, you can simply match your expression to the entire thing and get URLs. Keep in mind that it may not work on video files on a lot of popular websites, because now they use blob URLs (blob://), which in this particular case means that there is a JS function somewhere that feeds it a .m3u file instead. gallery-dl seems to use youtube-dl for videos, so it will probably work for this.
>I never really programmed in Python thou
It's babby script, if you come from C/C++ it shouldn't take you too much time to learn enough to implement this. Fuck indentation sensitivity though. Writing throwaway things like this works better in a higher level language.
If you really want to do it, C has Gumbo for HTML parsing. For requests you can just go with libcURL.
>>7980 >It's babby script, if you come from C/C++ it shouldn't take you too much time to learn enough to implement this. Fuck indentation sensitivity though. Writing throwaway things like this works better in a higher level language.
When I was a kid I wrote a crawler with C++ using libcurl, boost regex and pugixml for parsing and sqlite for keeping track of download history. It was good for getting better at C++ especially when I made it multithreaded to use multiple Tor circuits in parallel but it's painful and awkward.
When I stopped caring about improving my C++ I switched to Ruby and nokogiri [spoiler]fuck indentation sensitivity[/spoiler] which was much nicer.
But writing xpath expressions for new sites gets tedious because it takes alot of trial and error to get them right and then there end up being stupid exceptions you don't run into until running the crawler for a few weeks.
These days I just use curl and grep with bash to glue it together and a text file to keep track of download history.
>>7963 >gallary-dl
This looks awesome. Does it have support for history though? I mean you never really run a crawler only once, you want to run it continuously to get new posts and I want to delete files I don't care about without them getting re-downloaded next time the crawler runs. Is it sophisticated enough for that?
There is a booru-like website that i want to archive, how do i crawl all pages and scrape the images?
There are some youtube channels that i want to follow, how do i periodically get a list of videos made by a channel and then automatically download new uploaded videos?
There are some imageboards that i want to archive, how do i periodically crawl a board and download threads and images?
Do i just scrape the HTML, parse it, and then download the images/videos?
Best way to do it programmatically? Best practices? Advices?