Setareh Soltanieh| Blog - DisasterTweets

Disaster Tweet Classification: Identifying Real Crisis Events from Social Media

By Setareh Soltanieh

This project explores Natural Language Processing (NLP) techniques to classify tweets as either real disaster-related messages or unrelated content. Using the "NLP Getting Started" dataset from Kaggle, which contains over 10,000 labeled tweets, the model is trained to distinguish between emergency-related and non-critical information.

For my first attempt, which achieved a score of 0.78057, I took a straightforward yet effective approach to classifying disaster-related tweets.

Feature Extraction: I used CountVectorizer, a simple yet powerful bag-of-words technique, to transform the raw text into numerical features. This method counts the frequency of each word in the training set and applies the same tokenization process to the test set, ensuring consistency in feature representation.

Model Selection: I trained a RidgeClassifier, a linear model that is well-suited for text classification tasks due to its ability to handle high-dimensional sparse data efficiently. Ridge regression helps prevent overfitting by applying L2 regularization, making it a robust choice for this problem.

Despite its simplicity, this approach yielded a solid baseline score of 0.78057, demonstrating that even basic NLP techniques can provide competitive results in disaster tweet classification.

View my Kaggle Notebook

For my next attempt, I refined my approach and achieved an improved score of 0.80049 by making a key change in feature extraction.

Feature Extraction: Instead of CountVectorizer, I used TfidfVectorizer (Term Frequency-Inverse Document Frequency), another bag-of-words technique that assigns importance to words based on their frequency in a document relative to their occurrence in the entire dataset. This method downweights common words like "is" and "are" while emphasizing more meaningful words that contribute to distinguishing disaster-related tweets.

Model Selection: I continued using RidgeClassifier, as it had performed well in my previous attempt. By simply switching to TF-IDF features, the model learned a better representation of the text, leading to a noticeable improvement in performance.

This small yet effective change boosted my score to 0.80049, demonstrating the impact of better feature engineering in NLP tasks.

View my Kaggle Notebook