Imagine you have thousands of books and you are supposed put them in different sections of the library! you are not going to read every book to see which genre the book belongs to. Someone has already assigned the book to a genre: romance, drama, history,… These are labels and we can build a classifier to do this classification for us. But how?

Let’s see the definition of text classification first: Text classification is basically the process of automatically assigning categories or labels to text documents, based on their content. Some examples of the applications of text classification are sentiment analysis and spam detection.

Okay, now how are we going to automatically put documents in predefined categories? We can use a Python library called TextBlob to do that.

TextBlob is a Python library for processing textual data that provides an easy-to-use interface for performing common NLP tasks, including text classification. To use TextBlob for text classification, you first need to define a set of categories or labels that you want to classify your text into. For example, if you want to classify news articles into categories such as sports, politics, and entertainment, you would define these categories as labels.

Once you have defined your labels, you can train a TextBlob classifier on a labeled dataset of text documents. The TextBlob classifier uses a Naive Bayes algorithm to learn patterns in the training data and predict the most likely label for new, unlabeled documents.

In this example, I have two csv files. The first CSV file, called train, has around 8000 rows and two columns: a comment column and a topic column. The topics are “biology”, “physics”, and “chemistry”.

I am going to train on the above dataset and then evaluate the accuracy of the test dataset which is my second csv file with around 2000 rows and the same columns and labels.

Here is the code:

!pip install -U textblob
!python -m textblob.download_corpora
from textblob.classifiers import NaiveBayesClassifier

# reading the train csv and train the classifier
with open('train.csv', 'r', encoding='latin-1') as f:
  cl = NaiveBayesClassifier(f, format='csv')

# test the accuracy of the classifier on the test csv
with open('test.csv', 'r', encoding='latin-1') as f:
  print(cl.accuracy(f, format='csv'))

# prints 0.97

# classify a text using the classifier
text = "I'm skeptical. A heavier lid would be needed to build pressure, while a lighter lid is needed to move a lot with the release of pressure. I feel like I am missing something here."

print(cl.classify(text)
)
# prints physics

# check the informative features
print(cl.show_informative_features(5)
)



Similar Posts