Skip to content

yangcheng258/Patent_p2p_similarity_w2v

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 

Repository files navigation

A text-embedding-based approach to measuring patent-to-patent technological similarity

This repository provides adittional documentation and material to the following paper:

Abstract

This paper describes an efficiently scaleable approach to measuring technological similarity between patents by combining embedding techniques from natural language processing with nearest-neighbor approximation. Using this methodology, we are able to compute similarities between all existing patents, which in turn enables us to represent the whole patent universe as a technological network. We validate both technological signature and similarity in various ways and, using the case of electric vehicle technologies, demonstrate their usefulness in measuring knowledge flows, mapping technological change, and creating patent quality indicators. This paper contributes to the growing literature on text-based indicators for patent analysis.

Highlights

  • We develop a method to create vector representations of patents based on text data.
  • We describe an efficient process to use these vectors to create patent similarity-to-patent measures for large amounts of patents.
  • We provide all code and data for reproduction, use, and improvement.
  • We evaluate and illustrate the results empirically.
  • We illustrate the results of the created measures and metrics at the case of electric vehicle patents.

It contains the following elements:

  • Code and demo how to create TFIDF weighted w2v embeddings of patent abstratcs.
  • Code and demo how to store the created embeddings in Annoy, and retrieve p2p similarity measures.
  • Code and demo how to create aggregated statistics based on p2p similarity measures.
  • Full data on created p2p similarity measueres

Reproducing the model

Before running any code you must install Python3.7+ and requirement libraries:

  • Numpy 1.21+
  • Gensim 4.1+

The dataset includes 1k patent abstracts as patent_dtatset_sample_1k.csv.

The notebook (Patent_W2V_v2_version2.ipynb) contains all process for preprocessing, training word2vec, computing embeddings, and building trees to tune Annoy (Approximate Nearest Neighbors Oh Yeah).

Queries, comments, and feedback always welcome :)

Also, check out the newest version of SBERT based p2p simialrity measures and automated patent classification.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%