The Patent Name-Matching Project
State
of play:
1)
the USPTO/TAF-generated standardized assignee
names and numbers as of 2004 have been run through the name standardization
programs, producing namematch\coname04_std.zip.
2)
A master list of all Compustat company names
on the annual industrial files (including historical) as of 2005 has been run
through the name standardization programs, producing namematch\cshdr05_std.zip. All files are
stata 9 (SE) format – use Stat/transfer
if you need to reformat them.
These
two files have been merged by stem_name with the following results (in the file
coname04_cshdr05_match):
212,566
on coname04 not matched
17,483
on cshdr05 not matched
8,337
matched
Of
the 322 entries with more than 1000 patents between 1967 and 2004, the
following is the breakdown:
|
162 |
matched
to cshdr05 |
|
30 |
had
successor firms on cshdr05 |
|
6 |
universities |
|
10 |
government
agencies |
|
114 |
foreign
firms (some of whom will match to CS via ADRs, etc) |
The
current name standardization do files
are given below:
|
main
program for coname04 |
|
|
main
program for compustat headers |
|
|
special
coding for USPTO file |
|
|
remove
punctuation and clean special chars |
|
|
standardize
names, calling derwent’s recodes |
|
|
some
standardization of names from Derwent’s doc |
|
|
search
names for corporate names, create asstype |
|
|
search
names for non-corporate names (indiv, univ, hosp, inst, govt) |
|
|
create
stem names, with corporate form etc stripped. |
Bronwyn H.
Hall 5 December 2006
Stata code now exists that
further standardizes the names on the 2002 USPTO file of standardized assignee names
and classifies them into 6 types: firm, university, government, non-profit
institution, hospital, or individual. This code still needs some work, but I
will post it soon so others can look at it. Most of the recodes are based on
work by Van Looy et al at KU Leuven
and MaCarthy at IFS
An related but separable
project is the acquisition of a recent time series of ownership structure of
firms around the world (yes, we think big ☺). Anyone who
has advice on this or some data, please let me and/or Megan MacGarvie know.
Bronwyn H.
Hall 3 September 2006
Thanks to Jim Bessen, I now have
a copy of the USPTO/TAF-generated standardized assignee
names and numbers as of 2002, and have updated the pat63_02
file to identify the patent assignee for each assigned patent using the assignee
numbers in this file. This makes the discussion below obsolete, as this
assignee list and the 1963-2002 file covers all the patents (3.4 million) for
the period and we no longer have to match the older NBER data to the 2002 file.
The project now becomes the
problem of matching these names to the Compustat firm names. More later on
this.
Bronwyn H.
Hall 4 December 2004
As many people already know,
I have updated the NBER patent data to include patents issued through 2002, and
the merged dataset are available on this website as pat63_02b.zip
. The problem is that these data do not contain the names of the patent
assignees, because the names have not yet been standardized and matched to the
1963-1999 master name list stored at NBER (see www.nber.org/patents and read the
information about coname). For this reason I have started a project to
standardize and match the names. The state of play is stored on this website in
the hopes that others may wish to join in and contribute their name
standardization efforts. Here you will find the following
A
file with the ~220K organization names for the patenting organizations. For
the 2000-2002 patents (whose assignee names are given in pat00_02ass.zip) these organization names will be
used to match a standardized name back to the patents themselves. The complete
list of ~175K 1963-1999 names are included in the file because they can be
matched to the names from 2000-2002, thus providing continuity. In fact, almost
26K have already been matched, accounting for about 80 per cent of the patents
issued during this period. Documentation for this file is in assnames2_desc.doc.
Some stata do files that have done the cleaning to
date. Of particular interest is cleaname.do, which
defines a number of the rules I used.
Finally I have included a preliminary memo on the name-matching problem
in these data. It is incomplete, especially with respect to the Compustat
match, as I have not yet had time to pursue this problem.
Bronwyn H. Hall