Monday, February 10, 2014

Changes 2/10/14

NSF:

  • Created github page
  • Parser now checks for NSF clause right away, and skips any documents that don't have it (which is 99% of them)
  • Cleaned up console output.  Example: (Scraping 932 of 2396 - (3212 lines). No NSF reference, skipping.)
  • Fixed inefficient memory use by calling unicode() on the parsed strings.  From the bs4 wiki:
    If you want to use a NavigableString outside of Beautiful Soup, you should call unicode() on it to turn it into a normal Python Unicode string. If you don’t, your string will carry around a reference to the entire Beautiful Soup parse tree, even when you’re done using Beautiful Soup. This is a big waste of memory.
  • I am not the best at git, so I may sometimes upload the wrong thing or not upload at all.  Apologies in advance.
  • I will add in a timer to see how long the parse takes, it seems to take about 20 minutes.  I'll try running it on my desktop at home which has a much better processor

No comments:

Post a Comment