저널 n-gram 오픈소스 The General Index

2021. 10. 30. 11:27데이터 분석가로 살기


https://archive.org/details/GeneralIndex

The General Index consists of 3 tables derived from 107,233,728 journal articles.

A table of n-grams, ranging from unigrams to 5-grams, is extracted using SpaCy. Each  of the 355,279,820,087 rows of the n-gram table consists of an n-gram coupled with a journal article id.

A second table is constructed using Yake and consists of 19,740,906,314 rows, each with a keywords and an article id.

A third table associates an article id with metadata.