Imagine you have thousands of books and you are supposed put them in different sections of the library! you are not going to read every book to see which genre the book belongs to. Someone has already assigned the book to a genre: romance, drama, history,… These are labels and we can build a classifier to do this classification for us. But how?

Let’s see the definition of text classification first: Text classification is basically the process of automatically assigning categories or labels to text documents, based on their content. Some examples of the applications of text classification are sentiment analysis and spam detection.

Okay, now how are we going to automatically put documents in predefined categories? We can use a Python library called TextBlob to do that.

TextBlob is a Python library for processing textual data that provides an easy-to-use interface for performing common NLP tasks, including text classification. To use TextBlob for text classification, you first need to define a set of categories or labels that you want to classify your text into. For example, if you want to classify news articles into categories such as sports, politics, and entertainment, you would define these categories as labels.

Once you have defined your labels, you can train a TextBlob classifier on a labeled dataset of text documents. The TextBlob classifier uses a Naive Bayes algorithm to learn patterns in the training data and predict the most likely label for new, unlabeled documents.

In this example, I have two csv files. The first CSV file, called train, has around 8000 rows and two columns: a comment column and a topic column. The topics are “biology”, “physics”, and “chemistry”.

I am going to train a model on the above dataset and then evaluate the accuracy of the test dataset which is my second csv file with around 2000 rows and the same columns and labels.

Implementing Text Classification with TextBlob

Let’s walk through the steps of how to perform text classification using TextBlob. We will:

  • Train a Naive Bayes classifier on a labeled dataset.
  • Evaluate its accuracy on a test dataset.
  • Use the trained model to classify new text.

Installing and Importing Required Libraries

Before running the code, install TextBlob and download necessary corpora:

!pip install -U textblob
!python -m textblob.download_corpora

Training the Naive Bayes Classifier

from textblob.classifiers import NaiveBayesClassifier

# reading the train csv and train the classifier
with open('train.csv', 'r', encoding='latin-1') as f:
  cl = NaiveBayesClassifier(f, format='csv')

Evaluating Model Accuracy

# Testing the accuracy on the test dataset
with open('test.csv', 'r', encoding='latin-1') as f:
    print("Accuracy:", cl.accuracy(f, format='csv'))

Output

Accuracy: 0.97

This indicates that our model is 97% accurate, which is excellent for a simple classifier.

Classifying New Text

Let’s classify a new text snippet:

text = "I'm skeptical. A heavier lid would be needed to build pressure, while a lighter lid is needed to move a lot with the release of pressure. I feel like I am missing something here."
print("Predicted category:", cl.classify(text))

Output

Predicted category: physics

Checking Informative Features

To understand which words contribute the most to classification, we can check the informative features:

print(cl.show_informative_features(5))

This will display the top 5 words that influence classification the most.

Conclusion

Text classification is a crucial NLP task with numerous practical applications. Using TextBlob’s Naive Bayes Classifier, we successfully built a simple and effective model to classify scientific topics.

Key Takeaways:

  • Text classification automates the categorization of documents into predefined labels.
  • TextBlob provides a Naive Bayes-based classifier that is easy to use and implement.
  • Our trained classifier achieved 97% accuracy on a test dataset.
  • We can use this model for real-world applications like news categorization, spam filtering, and sentiment analysis.

If you want to explore more, try using different datasets or experimenting with feature engineering to improve classification accuracy.

Similar Posts