The Analysis of Connected Components and Clustering in Segmentation of Persian Texts

Author Affiliations

  • 1Faculty Member of Technical and Vocation University, Kerman, IRAN
  • 2 Faculty Member of Technical and Vocation University,Yazd, IRAN
  • 3Faculty member of Azad University, South Tehran Branch, Tehran, IRAN
  • 4Faculty member of Shahid Bahonar University, Kerman, IRAN

Res. J. Recent Sci., Volume 3, Issue (4), Pages 71-77, April,2 (2014)


According to the application development computer in human life and increasing use of structured electronic documents and advantages of using them, the need to convert paper documents into their electronic format and use of image processing has been increased. Among researches that have been done in this field, we can point to the identification of the words in texts that comprehensive researches have been done in different languages such as : English, Japanese and Chinese. However, in Persian and Arabic languages, due to the complexity of these languages such as letters interconnection and various forms for letters according to their position in word, it is still need to research in this field. Segmentation is one of the most important steps in letter recognition system that it accuracy and speed is very important. Segmentation of Persian texts is the hardest since the specification of this language. In this study, we try to present a fast and efficient algorithm than same algorithms for segmentation of Persian documents with that help of connected components and clustering, we pay to identification and grouping of text and image areas. The users of this project are typical and we can use it as preprocessing steps of Optical Character Recognition systems. This research has been done on a collection of 100 scanned images of Persian newspapers and magazines with 300 dpi clarification and also it shows the simulation results with accuracy rate of %92.3 and significant speed than other approaches such as Voronoi Diagram.


