So it looks like I'm going to have to figure out how to use Jupyter Notebook for some of this data science/data visualization. I think I fixed the index error I was having. I added in argparse so I can have a little more flexibility when I run it. I also do a little more cleanup to remove some excess HTML. I'm giving the script another run and trying to scrape 5000 pages. I have some sleeps in there because I don't want to DDoS them
. It's going to take a while for this to complete, but I'll probably have enough data to where I can leave them alone for a while.
Update: around 1750 pages in and the thoughts I had about memory usage have been put to bed. I've been watching /proc/meminfo and the numbers have been pretty steady. I also noticed that I missed a couple of possibilities during the HTML sanitization. Its still parsing them, but they won't be output to files. Looks like this is going to tie up my laptop until this evening.
Update: around 1750 pages in and the thoughts I had about memory usage have been put to bed. I've been watching /proc/meminfo and the numbers have been pretty steady. I also noticed that I missed a couple of possibilities during the HTML sanitization. Its still parsing them, but they won't be output to files. Looks like this is going to tie up my laptop until this evening.
Last edited: