Changes 2/10/14
NSF:
- Created github page
- Parser now checks for NSF clause right away, and skips any documents that don't have it (which is 99% of them)
- Cleaned up console output. Example: (Scraping 932 of 2396 - (3212 lines). No NSF reference, skipping.)
- Fixed inefficient memory use by calling unicode() on the parsed strings. From the bs4 wiki:
If you want to use a NavigableString outside of Beautiful Soup,
you should call unicode() on it to turn it into a normal Python
Unicode string. If you don’t, your string will carry around a
reference to the entire Beautiful Soup parse tree, even when you’re
done using Beautiful Soup. This is a big waste of memory.
- I am not the best at git, so I may sometimes upload the wrong thing or not upload at all. Apologies in advance.
- I will add in a timer to see how long the parse takes, it seems to take about 20 minutes. I'll try running it on my desktop at home which has a much better processor
No comments:
Post a Comment