Friday, February 28, 2014

Changes 2/28/14

NSF:

  • Figured out the problem with finding Government interest.  It was looking for the string '<paragraph-federal-research-statement>', but in practice, these tags are always <paragraph-federal-research-statement id=xx...>, so an opening tag was never found.
  • With this fixed, however, new problems showed themselves.  Since the parser code hadn't been run since I split patparser.py into its own file, there are now a ton of missing methods.  I am trying to figure out the class system, but it seems really counter-intuitive.
  • I need a way to call methods from the instance python file (run.py) within patparser.py, but there does not seem to be a way to do this.  In java, you would just make those methods static or pass the imported class the instance class's object with the this keyword.
  • In python, there is no this, and seemingly no way to get the object of the file that instantiated another file
  • I tried putting stuff into classes, but then you have to add an additional argument self to every single method.  Really?  That's a dumb thing to force on anyone who wants to use objects.
  • No github push because the code is in pieces

Thursday, February 27, 2014

Changes 2/27/14

NSF:

  • Verified that reading .breakpoint behaves correctly
  • Realized that I was probably getting low download speeds due to being on Wi-Fi rather than Ethernet
  • Added a check for <patent-application-publication>, the program will now force stop if this is not found anywhere
  • This is a temporary behavior which will allow me to pinpoint the date at which it is changed.
  • Added a sort to the list of urls.  It now sorts from oldest to newest, whereas before the order was based on the page layout.  It does this by using regex to remove all non-numbers, then sorting the strings alphabetically (since 10 comes before 010 and so on)
  • I still haven't gotten anything logged to the csv file though.  
  • I changed the NSF check slightly: when checking for the string 'nsf', it now does not remove spaces, so that if a word happens to have nsf in it, such as 'transfer', a false positive is not triggered.  When checking 'nationalsciencefoundation', it still ignores spaces.
  • Added output for when the government interest statement is found, but it never outputs, so it is possible that I somehow messed up the tag.  I will look into this tomorrow.

Wednesday, February 26, 2014

Changes 2/26/14

NSF:

  • The download speeds were really slow today, so I didn't get a whole lot done
  • Probably implemented forcing inclusion of the filename within .breakpoint, but I couldn't test it because I couldn't download any of the zip files.

Tuesday, February 25, 2014

Changes 2/25/14

NSF:

  • Fixed concurrent modification
  • Added a handler for Ctrl+C (break) to delay breaking until after a scrape has completed if one is currently in progress
  • Added a function that writes the current filename to a file called .breakpoint if the program breaks while scraping.
  • This will prevent the program from skipping a file that has been downloaded but not yet scraped.
  • I haven't yet implemented reading of .breakpoint, but I plan to have it remove the filename it contains from the remove array in removeDownloaded() thus keeping the filename in the queue to scrape.

Javascript:

  • Took quiz

Monday, February 24, 2014

Changes 2/24/14

NSF:

  • Switched to google for downloading the zip files (I downloaded a file from both and got 0.26 MB/s from reedtech and 1.33 MB/s from google)
  • Since google doesn't have the links in tables, I set the parser to find all links, then only keep the ones that started with 'ipa' or 'pa'.
  • Program now removes urls of any file already in the temp folder (after testing if it is a properly written zip file), causing it to no longer waste resources parsing already parsed files.
  • I will have to do a bit more testing to make sure it is actually working correctly.  I set it to print out all skipped files, and only prints that it is skipping every other file, even though it appears to correctly skip all of them
  • I realized that I am parsing through a list while removing elements from it.  In java, this gives a ConcurrentModificationException, but I guess in python it is technically allowed.  I will fix this tomorrow by adding all elements found in both lists to be added to a new list, and once the scan is complete, that list of elements will be removed from the urls list

Friday, February 21, 2014

Changes 2/21/14

NSF:

  • Added a method to parse year, month, and day out of the filename
  • Learned more regex
  • Set the program to download and scrape everything from 2001, but it didn't finish
  • So far, I've looked through 32 full xml files and found no NSF documents.  I have tested the program against sample files mentioning NSF, but it is still kind of discouraging to not get any results.
  • Also, I need to find copies of the DTD from various years in order to find how the tags changed

Thursday, February 20, 2014

Changes 2/20/14

NSF:

  • Learned more about how python modules, classes, and namespaces work
  • Learned how to do the Python equivalent of Java's try{}catch{}
  • Moved all parsing code into patparser.py to clean up a bit
  • Moved unzipping code from the main() method into a separate method 
  • Added a check for badzipfile exception, and the zip will be redownloaded if this is thrown

Wednesday, February 19, 2014

Changes 2/19/14

NSF:

  • The array storing the xml documents wasn't clearing after completion of a scrape, fixed that
  • Correctly implemented checking for local files before downloading, the only problem now is if you break before downloading a file, next time it will stop the program due to a bad zip file, I'll have to fix that
  • run.py is getting extremely long and hard to navigate, and I am thinking of splitting it into a utilities file, a main file, and an xml parsing file
  • Attempted to fix up console output a bit

Tuesday, February 18, 2014

Changes 2/18/14

NSF:

  • Program now finds the xml anywhere within the directory tree of the zip file
  • Began adding checks to see if xml files have already been downloaded.  If so, the local copy will be used
  • Cleaned up console output by using sys.stdout.write() with sys.stdout.flush() to overwrite one line instead of filling the entire console with output on new lines.
  • One problem I am having is that many of the full xml files I am working with do not mention NSF in the entire document, so sometimes it is hard to tell if things work correctly.
  • In order to fix the problem of tags differing as time goes on, I will have to find and look at the DTD changelogs.

Friday, February 14, 2014

Changes 2/13/14 and 2/14/14

I was updating ubuntu studio, but the update didn't work right and /dev got corrupted or something like that.  I booted into windows and installed a program that allowed me to mount the linux partition, and copied my /home/ folder out.  I had to reinstall from a flash drive, and it took me a lot of tries because I didn't put the bootloader on the right partition.

For future reference, for me: The fat32 partition isn't for the bootloader, it's a dell thing. You also should format the ubuntu partition just in case. Put the bootloader on /dev/sda and it will work.

NSF:

  • The program now scrapes http://patents.reedtech.com/parbft.php for all the patent urls, and should iterate through them.  It also gives a readout of the amount downloaded for each file.
  • I realized that the newer patent applications don't use <patent-application-publication> to denote the xml, but instead <us-patent-application>, which breaks the program as of now.
  • Also, not all the zip files have the xml in their root directory.  I am trying to figure out a way to iterate through the tree to find the xml file, but os.walk doesn't seem to work and I don't really want to unzip the entire zip directory tree (though I might have to)
  • Added more convenience methods to clean up the code a bit

Wednesday, February 12, 2014

Changes 02/12/14

NSF:

  • Started a new python file that will scrape http://patents.reedtech.com/parbft.php for all the patent zip file urls.  These will eventually be compiled into a list that the main program can iterate through

Visualize Algorithms:

  • Downloaded and set up VAP
  • Messed around with the demo.py program

Tuesday, February 11, 2014

Changes 2/11/14

NSF:

  • Removed lxml from the Federal Research Statement check
  • This eliminates the need for any xml file that does not pertain to the NSF to have its xml parsed and a memory structure created (I am assuming lxml does this)
  • Instead, my program reads any text between strings '<paragraph-federal-research-statement>' and '</paragraph-federal-research-statement>' (if they are found).
  • It then runs a regex on the strings between these two tags, converting everything to lower case and removing all spaces and numbers
  • If the strings 'nsf' or 'nationalsciencefoundation' are found, it logs to file.  Otherwise, it skips to the next one with no xml parsing ever occuring.
  • This results in a speedup of 80 times (Before it took 26.42 seconds to parse the first 100 items, now it takes 0.33 to parse the first 100 and only 11.55 seconds to run the entire 2,396 patent file)
  • This incredible speedup is largely due to the fact that the file I chose has no patent applications mentioning NSF, so no xml is ever parsed by lxml, and all the checks are done by string.find().
  • Started adding support for downloading the files directly from the web

Monday, February 10, 2014

Changes 2/10/14

NSF:

  • Created github page
  • Parser now checks for NSF clause right away, and skips any documents that don't have it (which is 99% of them)
  • Cleaned up console output.  Example: (Scraping 932 of 2396 - (3212 lines). No NSF reference, skipping.)
  • Fixed inefficient memory use by calling unicode() on the parsed strings.  From the bs4 wiki:
    If you want to use a NavigableString outside of Beautiful Soup, you should call unicode() on it to turn it into a normal Python Unicode string. If you don’t, your string will carry around a reference to the entire Beautiful Soup parse tree, even when you’re done using Beautiful Soup. This is a big waste of memory.
  • I am not the best at git, so I may sometimes upload the wrong thing or not upload at all.  Apologies in advance.
  • I will add in a timer to see how long the parse takes, it seems to take about 20 minutes.  I'll try running it on my desktop at home which has a much better processor

Thursday, February 6, 2014

Changes 2/6/14

NSF:

  • Fixed up csv data separation in the output
  • Started implementing support for the real world, bulk datasets

Wednesday, February 5, 2014

Changes 2/5/14

Javascript:

  • Made a header that fades as you scroll down the page
  • It fades in when you mouse over it, and fades back to the previous value when you move your mouse away from it ( using .hover() and .fadeTo() )
  • The one problem is the banner will still fade when scrolling even when the mouse is over it.  I will fix this tomorrow by adding a mouseover class to the header when the mouse is hovering over it, and not change the value if the class is found.

Monday, February 3, 2014

Changes 2/3/14

Javascript:

  • Looked at the Algorithm Visualizer code
  • Started trying to write jQuery code