Full-text Retrieval Model based on Term Frequency and Position Weighting
Download as PDF
DOI: 10.25236/icsemc.2017.06
Author(s)
Zhang Rui, Xie Puzhao, Sun Rui, Yang Luchang, Jiang Feiyue
Corresponding Author
Zhang Rui
Abstract
Nowadays mainstream literature retrieval system is based on the search terms, by extracting the document title, keyword, summary of literature to accomplish the function of retrieval.In this article, a full-text retrieval model based on lucene in computer science is purposed.The word frequency weighted algorithm is adopted to set the weighting coefficients in fields of the documents.The computer science literature's attributes are introduced into the evaluation model as an important indicator of the value of literature.The multifactor influence model employs simulated annealing algorithm to fit the best weight coefficients of each factor, making up the defect that Lucene default retrieval method can only retrieve byword frequency. The experimental data were divided into training set and the test set,whose emements are from CNKI.Weights of each field are trained by carrying out feature extraction.Then the model is validated by the test set consisting of a fixed number of high-quality document and inferior ones. The experimental results show that the trained model has higher precision in selecting high-quality documents.
Keywords
Lucene, Full-text search, Word frequency position weighting, Computer science, Multi- factor influenc emodel, Simulated annealing.