The Patent Name-Matching Project

 

State of play:

 

1) the USPTO/TAF-generated standardized assignee names and numbers as of 2004 have been run through the name standardization programs, producing namematch\coname04_std.zip.

2) A master list of all Compustat company names on the annual industrial files (including historical) as of 2005 has been run through the name standardization programs, producing namematch\cshdr05_std.zip. All files are stata 9 (SE) format – use Stat/transfer if you need to reformat them.

 

These two files have been merged by stem_name with the following results (in the file coname04_cshdr05_match):

 

212,566 on coname04 not matched

17,483 on cshdr05 not matched

8,337 matched

 

Of the 322 entries with more than 1000 patents between 1967 and 2004, the following is the breakdown:

 

162

matched to cshdr05

30

had successor firms on cshdr05

6

universities

10

government agencies

114

foreign firms (some of whom will match to CS via ADRs, etc)

 

The current name standardization do files are given below:

 

main_coname04 

main program for coname04

names_main_compustat   

main program for compustat headers

uspto_code     

special coding for USPTO file

punctuation       

remove punctuation and clean special chars

standard_name    

standardize names, calling derwent’s recodes

derwent_standardization_BHH

some standardization of names from Derwent’s doc

corporates    

search names for corporate names, create asstype

non_corporates   

search names for non-corporate names (indiv, univ, hosp, inst, govt)

stem_name

create stem names, with corporate form etc stripped.

 

Bronwyn H. Hall 5 December 2006

 

Stata code now exists that further standardizes the names on the 2002 USPTO file of standardized assignee names and classifies them into 6 types: firm, university, government, non-profit institution, hospital, or individual. This code still needs some work, but I will post it soon so others can look at it. Most of the recodes are based on work by Van Looy et al at KU Leuven and MaCarthy at IFS London. The next step is to apply it to the Compustat firms and then try for a match.

 

An related but separable project is the acquisition of a recent time series of ownership structure of firms around the world (yes, we think big ). Anyone who has advice on this or some data, please let me and/or Megan MacGarvie know.

 

Bronwyn H. Hall 3 September 2006

 

Thanks to Jim Bessen, I now have a copy of the USPTO/TAF-generated standardized assignee names and numbers as of 2002, and have updated the pat63_02 file to identify the patent assignee for each assigned patent using the assignee numbers in this file. This makes the discussion below obsolete, as this assignee list and the 1963-2002 file covers all the patents (3.4 million) for the period and we no longer have to match the older NBER data to the 2002 file.

 

The project now becomes the problem of matching these names to the Compustat firm names. More later on this.

 

Bronwyn H. Hall 4 December 2004

 

As many people already know, I have updated the NBER patent data to include patents issued through 2002, and the merged dataset are available on this website as pat63_02b.zip . The problem is that these data do not contain the names of the patent assignees, because the names have not yet been standardized and matched to the 1963-1999 master name list stored at NBER (see www.nber.org/patents and read the information about coname). For this reason I have started a project to standardize and match the names. The state of play is stored on this website in the hopes that others may wish to join in and contribute their name standardization efforts. Here you will find the following

 

A file with the ~220K organization names for the patenting organizations. For the 2000-2002 patents (whose assignee names are given in pat00_02ass.zip) these organization names will be used to match a standardized name back to the patents themselves. The complete list of ~175K 1963-1999 names are included in the file because they can be matched to the names from 2000-2002, thus providing continuity. In fact, almost 26K have already been matched, accounting for about 80 per cent of the patents issued during this period. Documentation for this file is in assnames2_desc.doc.

 

Some stata do files that have done the cleaning to date. Of particular interest is cleaname.do, which defines a number of the rules I used.

 

Finally I have included a preliminary memo on the name-matching problem in these data. It is incomplete, especially with respect to the Compustat match, as I have not yet had time to pursue this problem.

 

Bronwyn H. Hall  1 November 2004