Changes 2/11/14
NSF:
- Removed lxml from the Federal Research Statement check
- This eliminates the need for any xml file that does not pertain to the NSF to have its xml parsed and a memory structure created (I am assuming lxml does this)
- Instead, my program reads any text between strings '<paragraph-federal-research-statement>' and '</paragraph-federal-research-statement>' (if they are found).
- It then runs a regex on the strings between these two tags, converting everything to lower case and removing all spaces and numbers
- If the strings 'nsf' or 'nationalsciencefoundation' are found, it logs to file. Otherwise, it skips to the next one with no xml parsing ever occuring.
- This results in a speedup of 80 times (Before it took 26.42 seconds to parse the first 100 items, now it takes 0.33 to parse the first 100 and only 11.55 seconds to run the entire 2,396 patent file)
- This incredible speedup is largely due to the fact that the file I chose has no patent applications mentioning NSF, so no xml is ever parsed by lxml, and all the checks are done by string.find().
- Started adding support for downloading the files directly from the web
No comments:
Post a Comment