NLP Concepts

4 min readFeb 14, 2024

DR. Vegapunk | Punk-02, Lilith | Geekdama Reproduction

Introduction

Previously in this post, I talked a little about what NLP is and some ways to numerically represent texts so that we can perform tasks such as extracting text information, sentiment analysis, translation, text summarization, etc. In this post we will talk about some concepts and methods of this discipline, NLP.

Textual Elements

All NLP methods start with the textual dataset, also called the corpus. The corpus contains the raw text and its metadata. Raw text is a sequence of characters (bytes), which are grouped into continuous units called tokens

Tokens are simply words and number sequences separated by spaces or punctuation.

In the context of machine learning, a text with its metadata is called an instance or data point. This collection of instances is the dataset.

To break the text into words, we need to do the tokenization process. This process can be done using libraries such as NLTK and SpaCy.

Using the following sentence as an example:

from nltk import word_tokenize

text = "Kaizoku ou ni ore wa naru"

word_tokenize(text)

The output will be:

['Kaizoku', 'ou', 'ni', 'ore', 'wa', 'naru']

We can also find types, which are the unique tokens present in the text. The set of all types in a corpus form the vocabulary or lexicon.

N-Grams

N-grams are groupings of consecutive tokens by a fixed number. When we group 2 together, we call them bigrams, with 3 trigrams, and so on. N-grams are useful in text analysis applications where sequences of words have relevance, such as in sentiment analysis, text classification, and text generation.

Text Normalization

Words can vary in many ways, from inflection of gender and number in adjectives. For verbs we have the verbal inflection of mood (indicative, subjunctive, imperative), tense (present, past, future), number (plural and singular), person (I, you, he/she, we…). The inflection of nouns. These possibilities for variation considerably increase the number of tokens, increasing the dimensionality of the corpus. To normalize the text and reduce the size of the vector that will be used in the model we can use two techniques: Lemmatization and Stemming.

Lemmatization is the process of transforming the token into its root form and removing the inflection, placing verbs in the infinitive and other words in their basic form.

import spacy
spacy_lemma = spacy.load('pt_core_news_sm')

texto = 'comendo, dormindo, caminhando, laranjas, maçãs'

doc = spacy_lemma(texto)
doc_lematizado = [token.lemma_ for token in doc]
doc_lematizado = ' '.join(doc_lematizado) # removendo as vírgulas, não selecionando como um token
doc_lematizado

The output will be:

'comer , dormir , caminhar , laranja , Maçãs'

Stemming is a similar process, but only returns the token to its root or radical.

import nltk

# stemmer
stemmer = nltk.stem.SnowballStemmer("portuguese")

texto = 'comendo, dormindo, caminhando, laranjas, maçãs'

palavras = texto.split(', ')
palavras_stemming = [stemmer.stem(palavra) for palavra in palavras]
palavras_stemming

Your output:

['com', 'dorm', 'caminh', 'laranj', 'maçãs']

(Note that for the word “apple”, lemmatization and tokenization were not carried out, making it necessary to use another form of replacement, such as using REGEX in a real case in which it was necessary to replace this token)

Stopwords removal

Still in order to reduce the dimensionality of the vector, we can remove the stopwords. Stopwords are words that do not add any meaning or information, making no difference when removed. These are words like “as”, “the”, “ones”, “of”, “for”, etc. Just like word tokenization, lemmatization or stemming, stop words can be removed using libraries like NLTK or SpaCy.

Categorizations

Sentence categorization or classification is one of the main and first applications of NLP. Problems such as assigning topic labels, sentiment prediction, spam filtering, language identification, etc., can be classified as supervised document classification problems.

Another form of categorization is word categorization. We can extend the idea of document categorization to tokens, an example of this type of categorization is part-of-speech (POS) tagging. POS tagging is the process of categorizing words based on their definition, their context, and the specific part of speech. Thus, identifying verbs, adjectives, nouns, etc. POS tagging describes the characteristics of the lexical structure of the sentence, which can be used to make assumptions about the semantics. Some applications based on this POS tagging process are used for Named Entity Recognition (NER), Speech Recognition.

import spacy

nlp = spacy.load("pt_core_news_sm")

frase = "Eu vou me tornar o Rei dos Piratas"
documento = nlp(frase)

for token in documento:
  print(token.text, token.pos_)

Output:

Eu PRON
vou AUX
me PRON
tornar VERB
o DET
Rei NOUN
dos ADP
Piratas NOUN

Conclusion

In this article we looked at NLP terminologies, ideas and concepts, without seeking to exhaust the topic. It is important to note that NLP is a very rich field of research, which does not use neural networks and which have major impacts on business and are used in production. Contemporary approaches, based on neural networks, should be considered as a complement, not a replacement for traditional methods.

Follow my profile and sign up (here) to receive future posts on this and other topics.

(This is a translated version from Brazilian Portuguese of the article, the original is available here)

References

1. Rao, D., & McMahan, B. (2019). Natural Language Processing with PyTorch. O’Reilly Media. ISBN: 978–1–491–97823–8.

2. Alves, I. (2021). Lemmatization vs. stemming: quando usar cada uma? Alura. Recuperado de https://www.alura.com.br/artigos/lemmatization-vs-stemming-quando-usar-cada-uma

3. Diana, D. (s.d.). Classificação dos Verbos. Toda Matéria. Recuperado de https://www.todamateria.com.br/classificacao-dos-verbos/