Tuesday, December 17, 2013

Changes 12/17/13

NSF:

  • Installed lxml (html and xml parser) for use with BeautifulSoup
  • Read google doc on the assignment
  • XML and dealing with large amounts of data aren't something I am super familiar with so there is definitely a learning curve.
  • I understand the basics of xml (as it is not that different from html), but the way the nsf patents xml file is structured is hard to understand.
  • The tags containing data are formatted as <BXXX> and iterate up from <B100>.  I need to figure out which values correspond to the specific data fields we are looking for.
  • What I have found so far:
    • <B540> is the title
  • The main problem I am having is that I have no idea what some of the text fields are supposed to be, for example:
    <B511><PDAT>E21B 2510</PDAT></B511>
  • This line may not be important to what we are trying to find, but the problem is that I don't know how to tell which fields are which

1 comment:

  1. I have a suggestion for you. Register for the following free on-line intro to database course from Stanford University:

    https://class.stanford.edu/courses/Engineering/db/2014_1/about

    It is a very well put together course, using the same materials that Stanford students use.

    ReplyDelete