Monday, February 24, 2014

Changes 2/24/14

NSF:

  • Switched to google for downloading the zip files (I downloaded a file from both and got 0.26 MB/s from reedtech and 1.33 MB/s from google)
  • Since google doesn't have the links in tables, I set the parser to find all links, then only keep the ones that started with 'ipa' or 'pa'.
  • Program now removes urls of any file already in the temp folder (after testing if it is a properly written zip file), causing it to no longer waste resources parsing already parsed files.
  • I will have to do a bit more testing to make sure it is actually working correctly.  I set it to print out all skipped files, and only prints that it is skipping every other file, even though it appears to correctly skip all of them
  • I realized that I am parsing through a list while removing elements from it.  In java, this gives a ConcurrentModificationException, but I guess in python it is technically allowed.  I will fix this tomorrow by adding all elements found in both lists to be added to a new list, and once the scan is complete, that list of elements will be removed from the urls list

No comments:

Post a Comment