Programming Stuff: December 2013

Thursday, December 19, 2013

Changes 12/19/13

NSF:

(Incomplete) List of fields and their xml tags:

Application Number:

<application-number>

<doc-number>10044899</doc-number>

Cross Reference to Related Application:

<cross-reference-to-related-applications>

...

Inventors:

<inventors>

<first-named-inventor>

<inventor>

...

Abstract:
```
<subdoc-abstract>
```
Title:
```
<title-of-invention>
```

Document Date:

<document-date>20021003</document-date>

Tuesday, December 17, 2013

Changes 12/17/13

NSF:

Installed lxml (html and xml parser) for use with BeautifulSoup
Read google doc on the assignment
XML and dealing with large amounts of data aren't something I am super familiar with so there is definitely a learning curve.
I understand the basics of xml (as it is not that different from html), but the way the nsf patents xml file is structured is hard to understand.
The tags containing data are formatted as <BXXX> and iterate up from <B100>. I need to figure out which values correspond to the specific data fields we are looking for.
What I have found so far:

<B540> is the title

The main problem I am having is that I have no idea what some of the text fields are supposed to be, for example:
```
<B511><PDAT>E21B 2510</PDAT></B511>
```
This line may not be important to what we are trying to find, but the problem is that I don't know how to tell which fields are which

Monday, December 16, 2013

Changes 12/16/13

NSF:

I forgot my laptop today, so I experimented with BeautifulSoup's syntax on another computer.
The code I've written takes a downloaded version of a database search for "GOVT/"NATIONAL SCIENCE FOUNDATION"" and loads it into BeautifulSoup. It then iterates through the <tbody> element with the statement for child in soup.table.tbody.children:
I still can't quite figure it out, however the main problem I'm having is due to python's duck typing. I cannot figure out how to properly check if an object is of a certain type.
The code in question is: isinstance(child, NavigableString), and the error I get is something like: "Type NavigableString is not valid" or something like that. I could convert the type to a string and check if it equals "bs4.elements.NavigableString", but that doesn't seem like a very elegant solution, and I want to learn how to do things "the right way"

Friday, December 13, 2013

Changes 12/13/13

NSF:

Spent the whole day trying to get BeautifulSoup4 to work with python3 but it just won't work
It says it will automatically convert but it doesn't even install to the python3 folder, only the python2.7/dist-packages/ folder. I tried manually copying the bs4 folder to the python3/dist-packages/ folder and running 2to3, but it still doesn't work. It just comes up with syntax errors.
Well I got it to work after over an hour. The python 2to3 docs words the functionality in a way that makes it seem like it will recursively iterate through directories, but apparently it doesn't. I fixed the problem by manually running it for each directory in the bs4 folder.
I think by recursive it meant recursive for the files in the specific directory it was run in, not recursive in the directory tree.

Thursday, December 12, 2013

Changes 12/12/13

NSF:

Improved the python file to parse the actual html of the database search page
Running the program (user input in italics and my comments in bold):

Insert your search terms: National Science Foundation
Insert your tag: GOVT

After entering the search term and tag, the program outputs a useable url
http://appft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.html&r=0&p=1&f=S&l=50&Query=GOVT%2F%22National+Science+Foundation%22&d=PG01

It then parses the HTML from any <a> element in the <table> that holds all the data into a list

20130333037
METHODS, SYSTEMS, AND MEDIA FOR DETECTING COVERT MALWARE\n
20130332859
METHOD AND USER INTERFACE FOR CREATING AN ANIMATED COMMUNICATION\n

...

20130314948
Multi-Phase Grid Interface\n
20130314717
METHODS AND APPARATUS FOR LASER SCANNING STRUCTURED ILLUMINATION\n     MICROSCOPY AND TOMOGRAPHY\n

The program makes use of python's html.parser library
So far, it just outputs the name and number of each item in the dataset as a separate list item. Although the program doesn't do much now, it demonstrates successfully parsing and displaying data gathered from the html page
I'm not sure why it has odd formatting such as line breaks and white space, but I guess that is just part of the page parsing from the database.

Wednesday, December 11, 2013

Changes 12/11/13

Heading:

http://appft.uspto.gov/netahtml/PTO/search-adv.html has all the codes for searching specific fields of the applications
Created a basic python program to generate a search url and get the html.

import urllib.request

search = input("Insert your search terms: ")
tag = input("Insert your tag: ")
urlstring = "http://appft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.html&r=0&p={{page}}&f=S&l=50&Query={{tag}}%2F%22{{query}}%22&d=PG01"

# Replace placeholder strings with actual input

urlstring = urlstring.replace('{{page}}', str(1))

urlstring = urlstring.replace('{{tag}}', tag)

urlstring = urlstring.replace('{{query}}', search.strip().replace(' ', '+'))

print(urlstring)

# Get raw html from the query url

response = urllib.request.urlopen(urlstring)
html = response.read()
print(html)

So entering "National Science Foundation" for search and "GOVT" for tag generates a query identical to the example in the email

Friday, December 6, 2013

Changes 12/6/13

Math Drill:

Got all students functionality (adding, removing, and saving) to work with SQL
Fixed an error with the add students regex causing spaces to be deleted. It was '[\W-]+' when it should have been '[\W-]+ '
Added getStudents() convenience method for getting students from database. It executes the command SELECT * FROM students and returns all results
Added html page for /admin/ with links to admin functions, namely logging out and editing students

Thursday, December 5, 2013

Changes 12/5/13

Math Drill:

Worked on migrating the add and remove students functionality over to SQL
Adding students works, but removing students does not

Wednesday, December 4, 2013

Changes 12/4/13

Math Drill:

Added new tables to database: Students and Questions
These will replace the .txt files which held the student names and question/image pairs
Disallowed access to /admin/ pages without signing in first

NSF:

Explored databases
The Export Awards excel spreadsheet looks like it shows various information about patents awarded due to NSF funding. I assume that these are the reports in which people properly cited NSF as a grant giver, because I understood one of the main problems the NSF is facing is finding patents that weren't correctly filled out with proper NSF credit on them. All the patents in the spreadsheet, however, have the same Funding Agency and Awarding Agency Code, 4900, which I'm guessing is the NSF's id.
If there is some way to view businesses attached to the patents, then compiling earnings based on these patents will be possible.
http://patents.reedtech.com/pgrbft.php#2012 looks like it's a repository of xml patent data
I downloaded one xml file and looked at it. It was hard to figure out what it was showing because it had no styling, but it looked like a list of a bunch of different patents all in one file. I couldn't tell what the number after ipg meant (the file I downloaded was named ipg131203.xml)
The StateObligations.xml and InstitutionObligations.xml files look like they are listing the amount of different types of "obligations" by state or institution. I assume obligations are promises of grant money.

Changes for 12/2/13

Math Drill:

Changed path attribute on cookie to allow bottle to find it (before it was only availible on /login/submit/, now it is available throughout the site)
Added a list to store pairs of usernames and their subsequent random number for the secret attribute of the cookie
Began troubleshooting rare cases of duplicate accounts in the list

Tuesday, December 3, 2013

Changes for 12/3/13

Math Drill:

Added checkAuth() method for checking the legitimacy of the cookie
When a user is already present in the accounts list and they sign in again, the old list item is removed
Added code to show the logged in user's username of the admin/students/ page
Added ability for the user to log out

Changes over Thanksgiving Break 11/27/13 - 12/2/13

Math Drill:

Added database for storing users and passwords
Database stores sha256 salted hashes of passwords
Tried some simple SQL injection attacks on the username field. The python code interfacing with SQL was: ('SELECT password FROM users WHERE id = \"' + username + '\"').fetchone()
First attack: a"; INSERT INTO users VALUES ("injection", "ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb");" was typed into the username field
This results in the following line being sent to sqlite, where bold text is the injected SQL:
SELECT password FROM users WHERE id = "a"; INSERT INTO users VALUES ("injection", "ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb");""
This gave an error: sqlite3.Warning: You can only execute one statement at a time.
The second attack doesn't create a new statement, but instead adds SQL logic to the end of the executed statement
The SQL code executed looks like: SELECT password FROM users WHERE id = "a" OR ""=""
Normally, to sign in the user would have to enter username(testuser), password(password) to login. However, by using the username and password combination username(a" OR ""="), password(password), one can sign in without entering the username. This username string with injected SQL logic is equivalent to typing the username of the first user in the database, in this case the user "testuser".
Fixed injection by changing the execute statement to cur.execute('SELECT password FROM users WHERE id = ?', (username, )).fetchone()
This protects against injection attacks
Sanitized input on the "add students" page by removing non alpha-numeric characters entered by the user
Started working on a session cookie

Website:

Changed CSS a bit:
Made use of border-bottom and border-top properties instead of using custom <hr> elements
Added hover properties to links in the header