NSF:
- Installed lxml (html and xml parser) for use with BeautifulSoup
- Read google doc on the assignment
- XML and dealing with large amounts of data aren't something I am super familiar with so there is definitely a learning curve.
- I understand the basics of xml (as it is not that different from html), but the way the nsf patents xml file is structured is hard to understand.
- The tags containing data are formatted as <BXXX> and iterate up from <B100>. I need to figure out which values correspond to the specific data fields we are looking for.
- What I have found so far:
- <B540> is the title
- The main problem I am having is that I have no idea what some of the text fields are supposed to be, for example: <B511><PDAT>E21B 2510</PDAT></B511>
- This line may not be important to what we are trying to find, but the problem is that I don't know how to tell which fields are which
I have a suggestion for you. Register for the following free on-line intro to database course from Stanford University:
ReplyDeletehttps://class.stanford.edu/courses/Engineering/db/2014_1/about
It is a very well put together course, using the same materials that Stanford students use.