Text Classification with Python using Scikit-Learn

1. What is Text Classification?

Text classification is a common task in natural language processing, which involves assigning labels or categories to documents, depending upon the contents of the texts. Text classification has a variety of applications, such as detecting user sentiment from a tweet, classifying an email as spam or ham, classifying blog posts into different categories, etc.

In this article, we will build and compare three text classifiers to classify text messages as spam or ham (not spam) using Python and Scikit-Learn. The classifiers we will use are Multinomial Naive Bayes (MultinomialNB), Complement Naive Bayes (ComplementNB) which is suited for imbalanced datasets, and Linear Support Vector Classification (LinearSVC).

Note: I have a video tutorial for this post that you can find at the end of this article or on my Youtube channel Pythonology.

2. Importing Libraries

Firstly, we need to import the necessary libraries:

# !pip install pandas
import pandas as pd
# !pip install scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report

3. Loading the Dataset

First, let’s download this dataset which contains two columns: ‘Message’ and ‘Category’. ‘Category’ contains the actual label of the message (spam or ham), and ‘Message’ contains the text of the message.

df = pd.read_csv('dataset.csv')
X = df['Message']
y = df['Category']

The next step is normally to preprocess or clean the data. You can use Spacy or Textacy to do that. We will skip that step.

4. Splitting the Dataset

We split our dataset into training and testing sets:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

5. Scikit-learn pipeline

We will use the scikit-learn pipeline which is just a sequence of steps to take to build our classifier. In this pipeline, we convert the data to a numerical format using TfidfVectorizer and then specify our classifiers:

pipeMNB = Pipeline([
('tfidf', TfidfVectorizer()),('clf', MultinomialNB())
])
pipeCNB = Pipeline([
('tfidf', TfidfVectorizer()),('clf', ComplementNB())
])
pipeSVC = Pipeline([
('tfidf', TfidfVectorizer()),('clf', LinearSVC())
])

6. Building and Evaluating Models

Now we will build our models using MultinomialNB, ComplementNB, and LinearSVC, and training it (fitting) on our train data. After that, we predict the labels for our TEST data, and then we print out the accuracy score based on a comparison of the correct labels and our predictions

pipeMNB.fit(X_train, y_train)
predictMNB = pipeMNB.predict(X_test)
print(f"MNB: {accuracy_score(y_test, predictMNB):.2f}")

pipeCNB.fit(X_train, y_train)
predictCNB = pipeCNB.predict(X_test)
print(f"CNB: {accuracy_score(y_test, predictCNB):.2f}")

pipeSVC.fit(X_train, y_train)
predictSVC = pipeSVC.predict(X_test)
print(f"SVC: {accuracy_score(y_test, predictSVC):.2f}")

7. Results of our text classification

# MNB: 0.95
# CNB: 0.98
# SVC: 0.99

As you can see the LinearSVC acieved the best accuracy score followed by ComplementNB. SWe can try to improve the results by adding some parameters to our model, specially to the TfidfVectorizer. For example:

pipeMNB = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english, ngram_range=(1,3)),('clf', MultinomialNB())
])

Adding the sop_words parameters improves our MultinomialNB model but does not have much effect on others. Also, adding the ngram_range improves the ComplementNB a bit, but does not affect the accuracy of Multinomial and SVC. So, you need to play around with these to see which one works best for your dataset.

Let’s check hhow our model works on a message:

message = "you have won a $10000 prize! contact us fot eh reward!"
result = pipeSVC.predict([message])
print("Result: ", result[0])

# Result: spam

And that’s it! We have successfully built and compared three different text classifiers for spam detection using Python and Scikit-Learn.

8. Video Tutorial

pythonology channel: build text classifiers

Similar Posts