Skip to content
This repository was archived by the owner on Nov 14, 2022. It is now read-only.

Setup Instructions

gregtheyoung edited this page Dec 12, 2014 · 9 revisions

InventorDisambiguator Instructions

Setting up an Environment

The current code, found here at Github, has been tested to run on a Windows 2012 R2 Standard machine with a minimum of XXG RAM and XXG disk.

These instructions presume:

  • Windows 2012 R2 Standard has already been installed, as is the case, e.g., with a new AWS (Amazon Web Service) EC2 (elastic cloud compute) instance.
  • Nothing else has been installed except for the standard OS install.
  • The machine has a C drive on which the OS has been installed and other software will be installed, and a D drive which will be used for all InventorDisambiguator code and data processing. Adjust the instructions accordingly if a different drive letter is used.
  • Downloaded files are put in d:\Downloads. Adjust the instructions accordingly if a different drive letter is used.
  • The user of this document is familiar with installing software and use of the software once installed.
  • For the sake of background, this setup has worked by using an AWS EC2 instance using the instance: “Microsoft Windows Server 2012 R2 Base - ami-beca16d6”.
  1. Download and install MSVC++ 2010 redist. This is needed for Octave
  2. http://www.microsoft.com/en-us/download/details.aspx?id=8328
  3. Download and install MSVC++ 2012 redist, x86. This is needed for PHP
  4. http://www.microsoft.com/en-us/download/details.aspx?id=30679
  5. When asked, the file you want is VSU_4\vcredist_x86.exe
  6. [Note: deprecated - replaced with Julia below] Download and install Octave 3.6.4 precompiled for Windows Visual Studio from here:
  7. http://sourceforge.net/projects/octave/files/Octave%20Windows%20binaries/
  8. Run octave-3.6.4-vs2010-setup.exe using all default settings
  9. Download and install PHP 5.5 (VC11 x86 Thread Safe) from this page: http://windows.php.net/download/
  10. http://windows.php.net/downloads/releases/php-5.5.17-Win32-VC11-x86.zip
  11. Unzip to c:\PHP
  12. Install per instructions here. These two sets of instructions specifically: 1. http://us3.php.net/manual/en/install.windows.manual.php 2. http://us2.php.net/manual/en/install.windows.commandline.php 3. The specific steps that were done during the last install were:
    1. copy the php.ini-production into php.ini
    2. In php.ini
    3. Removed comment from line 721: extension_dir = "ext"
  13. Add c:\php to the system environment PATH variable
  14. Add .PHP to the system environment PATHEXT variable
  15. [Note - deprecated - see C++ setup below] Download and install Julia
  16. Download the Windows 64-bit from here: http://julialang.org/downloads/
  17. Run the downloaded file and specify C:\Julia
  18. Download and install MinGW, g++, the Eigen library, and zlib1.dll
  19. Download and install mingw 64-bit: http://sourceforge.net/projects/mingw-w64/ 1. Select x86_64 when the option dialog is presented. 2. After install, add C:\Program Files\mingw-w64\x86_64-4.9.2-posix-seh-rt_v3-rev0\mingw64\bin to user environment PATH variable (presuming you installed version 4.9.2 - adjust as necessary).
  20. Download and install the Eigen library 1. http://bitbucket.org/eigen/eigen/get/3.2.2.zip 2. Unzip to c:\Eigen_3.2.2

Preparing for a First Time Run of the InventorDisambiguator

  1. Download and install disambiguator code from https://github.com/CSSIP-AIR/InventorDisambiguator
  2. Put it into d:\InventorDisambiguator 1. Note that by default if you use the “Download Zip” from the GitHub site, the zip file will have an extra directory of “PatentsProcessor-master” in it to reflect the “master” branch, so you may to use settings in your unzip tool (7zip) or move files so that the actual files begin directly under d:\PatentsProcessor.
  3. Compile the main.cpp file 1. g++ --std=c++11 -o disambig -Wall -DNDEBUG -Ic:\Eigen_3.2.2 main.cpp

Run the InventorDisambiguator

  1. Go to d:\InventorDisambiguator
  2. Get a TSV (tab separated value) file from the PatentProcessor and put it in the directory. It will be produced via run_consolidate.bat as called by start.py.
  3. It has the naming convention of disambiguator_[MM]_[dd].tsv where MM is the month it was produced and dd is the day of the month. For example: disambiguator_August_18.tsv
  4. php Initialize_Input.php disambiguator_August_18.tsv
  5. php Initialize_ID.php
  6. php Matrixify_Attributes.php
  7. [Note - deprecated] c:\software\octave-3.6.4\bin\octave.exe
  8. source("load.m")
  9. source("disambig.m")
  10. disambig d:\InventorDisambiguator
  11. You should now have a _disambiguator_output.tsv file.
  12. That file will be used by the PatentsProcessor. See that project for instructions on how to use this file to integrate back into the patents database.

Special Notes

  • Note: when running on an EC2 c3.large (3.75G RAM), I would get a zend_mm_heap corrupted message when running initialize_id.php. Changing to a r2.2xlarge (61G RAM) fixed the problem. Didn't try intermediate sizes.
  • When running the matrixify_attributes.php, it was designed to use multiple threads via the PCTNL library. That is not supported for PHP on Windows. I removed that for now, so it will only run with one thread, and thus only use one core at a time.
  • When running the Disambig.m, it also only uses one core. Octave is single-threaded, as is MatLib (I believe). In order to make this part, which can run for a week!, perform better, the algorithm would have to be changed to make it parallel. Another thought is to do blocking and spawn an instance for each block. For example, since the current algorithm will never collapse if at least the first initial and last name are identical, then we could block them into, say, 26 blocks, with each block containing rows that share the same last initial. There'd be much more code change than that (e.g. creating unique IDs for disambiguated inventors), but it could possibly run in maybe 1/10th the time.
  • Using Matlab rather than Octave in a test on all 2005 patents: Matlab took 7 minutes, Octave was still going when killed after 48 hours.