Programming Stuff: Changes 12/17/13

Tuesday, December 17, 2013

Installed lxml (html and xml parser) for use with BeautifulSoup
Read google doc on the assignment
XML and dealing with large amounts of data aren't something I am super familiar with so there is definitely a learning curve.
I understand the basics of xml (as it is not that different from html), but the way the nsf patents xml file is structured is hard to understand.
The tags containing data are formatted as <BXXX> and iterate up from <B100>. I need to figure out which values correspond to the specific data fields we are looking for.
What I have found so far:

The main problem I am having is that I have no idea what some of the text fields are supposed to be, for example:
```
<B511><PDAT>E21B 2510</PDAT></B511>
```
This line may not be important to what we are trying to find, but the problem is that I don't know how to tell which fields are which