bag of words analysis

Bag of Words Analysis from the Content Using Python





The bag of words (BoW) is a common technique used in natural language processing (NLP) to represent text data quantitatively. It is a simplified representation that focuses on the occurrence and frequency of words within a document or a collection of documents.

In the bag of words model, a text document is represented as an unordered collection or “bag” of words, disregarding grammar, word order, and context. It treats each word in the document as a separate and independent feature. The resulting representation captures the presence or absence of words and their frequencies within the document.

Here’s a high-level overview of how the bag of words representation is created:

Tokenization: The text data is divided into individual words or tokens. Commonly, this process involves splitting the text based on whitespace, punctuation, or other delimiters.

Vocabulary Creation: A vocabulary is constructed by gathering all the unique words from the entire corpus (collection of documents).

Word Counting: For each document, the occurrence or frequency of each word in the vocabulary is counted. This information is typically represented in a matrix, where each row corresponds to a document, and each column corresponds to a word in the vocabulary. The matrix contains the word counts (or frequencies) for each document.

The bag of words representation simplifies text data into a numerical format that machine learning algorithms can process. It has applications in various NLP tasks, such as text classification, sentiment analysis, topic modeling, and information retrieval.

First need to copy and paste the page content in the doc file:

Create a folder and save the doc file:

Now open the python terminal:

Run those 2 package in python:

pip install python-docx

pip install scikit-learn

This is the proper python code:

import docx

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer

# Load the DOC file

doc_path = “C:\\Users\\SUBRATA\\Desktop\\bagofword\\doc.docx”

doc = docx.Document(doc_path)

# Extract the text from each paragraph in the document

corpus = []

for paragraph in doc.paragraphs:


# Initialize CountVectorizer and fit the corpus

vectorizer = CountVectorizer()

# Transform the corpus into a bag of words representation

bag_of_words = vectorizer.transform(corpus)

# Get the feature names (words)

feature_names = vectorizer.get_feature_names()

# Create a DataFrame from the bag of words

df = pd.DataFrame(bag_of_words.toarray(), columns=feature_names)

# Export the DataFrame to Excel

output_path = “C:\\Users\\SUBRATA\\Desktop\\bagofword\\doc.xlsx”

df.to_excel(output_path, index=False)

Before running the code, please create a excel file in the same folder:

Now run the code:


By following these steps, you can extract data from a DOC file, create a bag of words representation, and export the data to an Excel file for further analysis or use.

Add a SUM formula at the end to get the proper data of each queries:

Note that while the bag of words representation is useful for capturing word frequency information, it does not consider the order of words or capture semantic relationships between them. Other techniques, such as n-grams, word embeddings, or contextual models like Transformers, address these limitations by incorporating more contextual information.