We’ve given the CRAN task view on Natural Language Processing an overhaul and added the following packages to the list:

  • gutenbergr allows downloading and processing public domain works in the Project Gutenberg collection. Includes metadata for all Project Gutenberg works, so that they can be searched and retrieved.
  • hunspell is a stemmer and spell-checker library designed for languages with rich morphology and complex word compounding or character encoding. The package can check and analyze individual words as well as search for incorrect words within a text, latex or (R package) manual document.
  • monkeylearn provides a wrapper interface to machine learning services on Monkeylearn for text analysis, i.e., classification and extraction.
  • mscstexta4r provides an interface to the Microsoft Cognitive Services Text Analytics API and can be used to perform sentiment analysis, topic detection, language detection, and key phrase extraction.
  • mscsweblm4r provides an interface to the Microsoft Cognitive Services Web Language Model API and can be used to calculate the probability for a sequence of words to appear together, the conditional probability that a specific word will follow an existing sequence of words, get the list of words (completions) most likely to follow a given sequence of words, and insert spaces into a string of words adjoined together without any spaces (hashtags, URLs, etc.).
  • PGRdup supports fuzzy, phonetic and semantic matching of words. In particular, the DoubleMetaphone function converts strings to double metaphone phonetic codes.
  • phonics provides a collection of phonetic algorithms including Soundex, Metaphone, NYSIIS, Caverphone, and others.
  • quanteda supports quantitative analysis of textual data.
  • tesseract is an OCR engine with unicode (UTF-8) support that can recognize over 100 languages out of the box.
  • text2vec provides tools for text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), and similarities.
  • tidytext provides means for text mining for word processing and sentiment analysis using dplyr, ggplot2, and other tidy tools.