Thursday, December 19, 2013

Changes 12/19/13

NSF:

  • (Incomplete) List of fields and their xml tags:
  • Application Number:
    <application-number> 
    <doc-number>10044899</doc-number>
  • Cross Reference to Related Application:
    <cross-reference-to-related-applications>
     ...
  • Inventors:
    <inventors>
    <first-named-inventor>
    <inventor>
    ... 
  • Abstract:
    <subdoc-abstract>
  • Title:
    <title-of-invention>
  • Document Date:
    <document-date>20021003</document-date>  

Tuesday, December 17, 2013

Changes 12/17/13

NSF:

  • Installed lxml (html and xml parser) for use with BeautifulSoup
  • Read google doc on the assignment
  • XML and dealing with large amounts of data aren't something I am super familiar with so there is definitely a learning curve.
  • I understand the basics of xml (as it is not that different from html), but the way the nsf patents xml file is structured is hard to understand.
  • The tags containing data are formatted as <BXXX> and iterate up from <B100>.  I need to figure out which values correspond to the specific data fields we are looking for.
  • What I have found so far:
    • <B540> is the title
  • The main problem I am having is that I have no idea what some of the text fields are supposed to be, for example:
    <B511><PDAT>E21B 2510</PDAT></B511>
  • This line may not be important to what we are trying to find, but the problem is that I don't know how to tell which fields are which

Monday, December 16, 2013

Changes 12/16/13

NSF:

  • I forgot my laptop today, so I experimented with BeautifulSoup's syntax on another computer.
  • The code I've written takes a downloaded version of a database search for "GOVT/"NATIONAL SCIENCE FOUNDATION"" and loads it into BeautifulSoup.  It then iterates through the <tbody> element with the statement for child in soup.table.tbody.children:
  • I still can't quite figure it out, however the main problem I'm having is due to python's duck typing.  I cannot figure out how to properly check if an object is of a certain type.
  • The code in question is: isinstance(child, NavigableString), and the error I get is something like: "Type NavigableString is not valid" or something like that. I could convert the type to a string and check if it equals "bs4.elements.NavigableString", but that doesn't seem like a very elegant solution, and I want to learn how to do things "the right way"

Friday, December 13, 2013

Changes 12/13/13

NSF:

  • Spent the whole day trying to get BeautifulSoup4 to work with python3 but it just won't work
  • It says it will automatically convert but it doesn't even install to the python3 folder, only the python2.7/dist-packages/ folder.  I tried manually copying the bs4 folder to the python3/dist-packages/ folder and running 2to3, but it still doesn't work.  It just comes up with syntax errors.
  • Well I got it to work after over an hour.  The python 2to3 docs words the functionality in a way that makes it seem like it will recursively iterate through directories, but apparently it doesn't.  I fixed the problem by manually running it for each directory in the bs4 folder.  
  • I think by recursive it meant recursive for the files in the specific directory it was run in, not recursive in the directory tree.

Thursday, December 12, 2013

Changes 12/12/13

NSF:

  • Improved the python file to parse the actual html of the database search page
  • Running the program (user input in italics and my comments in bold):
  • Insert your search terms: National Science Foundation
    Insert your tag: GOVT 
    After entering the search term and tag, the program outputs a useable url
    http://appft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.html&r=0&p=1&f=S&l=50&Query=GOVT%2F%22National+Science+Foundation%22&d=PG01 
    It then parses the HTML from any <a> element in the <table> that holds all the data into a list 
    20130333037
    METHODS, SYSTEMS, AND MEDIA FOR DETECTING COVERT MALWARE\n
    20130332859
    METHOD AND USER INTERFACE FOR CREATING AN ANIMATED COMMUNICATION\n
    ...
    20130314948
    Multi-Phase Grid Interface\n
    20130314717
    METHODS AND APPARATUS FOR LASER SCANNING STRUCTURED ILLUMINATION\n     MICROSCOPY AND TOMOGRAPHY\n
     
  • The program makes use of python's html.parser library
  • So far, it just outputs the name and number of each item in the dataset as a separate list item. Although the program doesn't do much now, it demonstrates successfully parsing and displaying data gathered from the html page 
  • I'm not sure why it has odd formatting such as line breaks and white space, but I guess that is just part of the page parsing from the database.

Wednesday, December 11, 2013

Changes 12/11/13

Heading:

  • http://appft.uspto.gov/netahtml/PTO/search-adv.html has all the codes for searching specific fields of the applications 
  • Created a basic python program to generate a search url and get the html.
  • import urllib.request
    
    search = input("Insert your search terms: ")
    tag = input("Insert your tag: ")
    urlstring = "http://appft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.html&r=0&p={{page}}&f=S&l=50&Query={{tag}}%2F%22{{query}}%22&d=PG01" 
    # Replace placeholder strings with actual input 
    urlstring = urlstring.replace('{{page}}', str(1)) 
    urlstring = urlstring.replace('{{tag}}', tag)
    urlstring = urlstring.replace('{{query}}', search.strip().replace(' ', '+'))
    print(urlstring)
     
    # Get raw html from the query url 
    response = urllib.request.urlopen(urlstring)
    html = response.read()
    print(html)
  • So entering "National Science Foundation" for search and "GOVT" for tag generates a query identical to the example in the email

Friday, December 6, 2013

Changes 12/6/13

Math Drill:

  • Got all students functionality (adding, removing, and saving) to work with SQL
  • Fixed an error with the add students regex causing spaces to be deleted.  It was '[\W-]+' when it should have been '[\W-]+ '
  • Added getStudents() convenience method for getting students from database. It executes the command SELECT * FROM students and returns all results
  • Added html page for /admin/ with links to admin functions, namely logging out and editing students

Thursday, December 5, 2013

Changes 12/5/13

Math Drill:

  • Worked on migrating the add and remove students functionality over to SQL
  • Adding students works, but removing students does not

Wednesday, December 4, 2013

Changes 12/4/13

Math Drill:

  • Added new tables to database: Students and Questions
  • These will replace the .txt files which held the student names and question/image pairs
  • Disallowed access to /admin/ pages without signing in first

NSF:

  • Explored databases
  • The Export Awards excel spreadsheet looks like it shows various information about patents awarded due to NSF funding.  I assume that these are the reports in which people properly cited NSF as a grant giver, because I understood one of the main problems the NSF is facing is finding patents that weren't correctly filled out with proper NSF credit on them.  All the patents in the spreadsheet, however, have the same Funding Agency and Awarding Agency Code, 4900, which I'm guessing is the NSF's id.
  • If there is some way to view businesses attached to the patents, then compiling earnings based on these patents will be possible.
  • http://patents.reedtech.com/pgrbft.php#2012 looks like it's a repository of xml patent data
  • I downloaded one xml file and looked at it.  It was hard to figure out what it was showing because it had no styling, but it looked like a list of a bunch of different patents all in one file.  I couldn't tell what the number after ipg meant (the file I downloaded was named ipg131203.xml)
  • The StateObligations.xml and InstitutionObligations.xml files look like they are listing the amount of different types of "obligations" by state or institution.  I assume obligations are promises of grant money.

Changes for 12/2/13

Math Drill:

  • Changed path attribute on cookie to allow bottle to find it (before it was only availible on /login/submit/, now it is available throughout the site)
  • Added a list to store pairs of usernames and their subsequent random number for the secret attribute of the cookie
  • Began troubleshooting rare cases of duplicate accounts in the list

Tuesday, December 3, 2013

Changes for 12/3/13

Math Drill:

  • Added checkAuth() method for checking the legitimacy of the cookie
  • When a user is already present in the accounts list and they sign in again, the old list item is removed
  • Added code to show the logged in user's username of the admin/students/ page
  • Added ability for the user to log out

Changes over Thanksgiving Break 11/27/13 - 12/2/13

Math Drill:

  • Added database for storing users and passwords
  • Database stores sha256 salted hashes of passwords
  • Tried some simple SQL injection attacks on the username field.  The python code interfacing with SQL was: ('SELECT password FROM users WHERE id = \"' + username + '\"').fetchone()
  • First attack: a"; INSERT INTO users VALUES ("injection", "ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb");" was typed into the username field
  • This results in the following line being sent to sqlite, where bold text is the injected SQL:
    SELECT password FROM users WHERE id = "a"; INSERT INTO users VALUES ("injection", "ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb");""
  • This gave an error: sqlite3.Warning: You can only execute one statement at a time.
  • The second attack doesn't create a new statement, but instead adds SQL logic to the end of the executed statement
  • The SQL code executed looks like: SELECT password FROM users WHERE id = "a" OR ""=""
  • Normally, to sign in the user would have to enter username(testuser), password(password) to login. However, by using the username and password combination username(a" OR ""="), password(password), one can sign in without entering the username. This username string with injected SQL logic is equivalent to typing the username of the first user in the database, in this case the user "testuser".
  • Fixed injection by changing the execute statement to cur.execute('SELECT password FROM users WHERE id = ?', (username, )).fetchone()
    This protects against injection attacks
  • Sanitized input on the "add students" page by removing non alpha-numeric characters entered by the user
  • Started working on a session cookie

Website:

  • Changed CSS a bit:
  • Made use of border-bottom and border-top properties instead of using custom <hr> elements
  • Added hover properties to links in the header