Text Processing: Sentence & Word Tokenisation, Stemming and Lemmatisation

⚡️Hudson Ⓜ️endes
4 min readJan 17, 2023
"robot eating a lot of text in a fantasy world" , generated by https://stablediffusionweb.com/

“Data are represented in ways natural to problems from which they were derived”[1]. Consequently, it is not hard to understand that we don’t encode our love letters into a vectorial representation of our feelings and its meaning before writing it down.

Data Scientists must undergo a process named Text Processing in order to prepare the “vast amount of text in the form of personal web pages, Twitter feeds, email, Facebook status updates, product descriptions, Reddit comments, blog postings”[1] which have become so prominent with the growing democratisation of the internet.

Text processing is a crucial step in many natural language processing (NLP) tasks, including text classification, information retrieval, and machine translation. The process involves several sub-tasks, including sentence segmentation, word tokenization, stemming, and lemmatization. This essay will discuss these sub-tasks and their importance in NLP.

A special type of what was originally proposed as “Discourse Segmentation”[2] is the so-called Sentence segmentation, the process of dividing a piece of text into individual sentences often referred to as a Text Normalisation technique[3]. Though it might seem relatively straightforward for most languages, the approach of separating sentences by punctuation…

--

--