SMS text similarity calculation based on topic model

SMS text similarity calculation based on topic model

Chengfang Tan1, 2, Caiyin Wang2, Lin Cui1, 2 

1School of Information Engineering, Suzhou University, Suzhou 234000, Anhui, China

2Intelligent Information Processing Lab, Suzhou University, Suzhou 234000, Anhui, China

The traditional text similarity calculation is mainly based on the statistical method and the semantic method, it exists data sparse and high-dimensional problems and so on. In order to improve the ability of SMS text similarity calculation, this paper puts forward a kind of similarity calculation method based on topic model. By using LDA (Latent Dirichlet Allocation) to model SMS document set and inference parameter via Gibbs sampling algorithm. The topic-word probability distribution and document - topic probability distribution of the SMS document set are generated. Then use JS (Jensen-Shannon) distance formula to calculate SMS text similarity, finally perform the text clustering experiments on the similarity matrix by single-pass incremental clustering algorithms. Compared with traditional text similarity calculation method, experimental results show that this proposed method can obtain better F-measure, which proves the effectiveness and superiority of the proposed text similarity calculation method.