Changes 2/24/14
NSF:
- Switched to google for downloading the zip files (I downloaded a file from both and got 0.26 MB/s from reedtech and 1.33 MB/s from google)
- Since google doesn't have the links in tables, I set the parser to find all links, then only keep the ones that started with 'ipa' or 'pa'.
- Program now removes urls of any file already in the temp folder (after testing if it is a properly written zip file), causing it to no longer waste resources parsing already parsed files.
- I will have to do a bit more testing to make sure it is actually working correctly. I set it to print out all skipped files, and only prints that it is skipping every other file, even though it appears to correctly skip all of them
- I realized that I am parsing through a list while removing elements from it. In java, this gives a ConcurrentModificationException, but I guess in python it is technically allowed. I will fix this tomorrow by adding all elements found in both lists to be added to a new list, and once the scan is complete, that list of elements will be removed from the urls list
No comments:
Post a Comment