HomeMachine LearningBag of Words – Count Vectorizer

Bag of Words – Count Vectorizer

In this blog post we will understand bag of words model and see its implementation in detail as well

Introduction (Bag of Words)

This is one of the most basic and simple methods to convert a list of words to vectors. The idea behind this is a simple, suppose we have a list of words lets say (n) in our corpus. We create a vector of size n and put the value 1 where that word is present and rest all values to 0. This is called one hot encoding as well.

To explain this further, lets suppose our corpus has words “NLP”, “is” , “awesome”. To convert this into bag of words model then it would be some thing like

"NLP" => [1,0,0]
"is" => [0,1,0]
"awesome" => [0,0,1]

So we convert the words to vectors using simple one hot encoding.

Ofcouse, this is a very simple model and has lot of problems

  • If our list of words is very large this would create very large word vectors which are sparse that all values 0 expect 1. This is not very efficient
  • We loose any semantic information of the words, there relevance etc in this model

Let’s see this in practice how this looks here

Basic Implementation

Let’s first do the basic imports

import spacy 
nlp = spacy.load("en_core_web_sm")

from sklearn.feature_extraction.text import CountVectorizer

import matplotlib.pyplot as plt

Scikit learn has library called “CountVectorizer” which does the same https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

sentence = "NLP is awesome"
count_vectorizer = CountVectorizer()  
count_vectorizer.fit(sentence.split())
matrix = count_vectorizer.transform(sentence.split())
print(matrix.todense())


#output
[[0 0 1]
 [0 1 0]
 [1 0 0]]

So this is the representation we have in vector format for our words and its very easy to understand.

Let’s see another example

# bag of words very simple example 

sentences = ["NLP is awesome","I want to learn NLP"]
count_vectorizer = CountVectorizer()  
count_vectorizer.fit(sentences)

new_sentense = "How to learn NLP?"

matrix = count_vectorizer.transform(new_sentense.split())
print(matrix.todense())

#output
[[0 0 0 0 0 0]
 [0 0 0 0 1 0]
 [0 0 1 0 0 0]
 [0 0 0 1 0 0]]

as we can see the first word “How” is not present in our bag of words, hence its represented as 0

More advanced usage

In this we are using a dataset from ski learn


import nltk
nltk.download('punkt')

import pandas as pd
import numpy as np

from sklearn.datasets import fetch_20newsgroups

news = fetch_20newsgroups(subset="train")

print(news.keys())

df = pd.DataFrame(news['data'])
print(df.head())


count_vectorizer = CountVectorizer(
    analyzer="word", tokenizer=nltk.word_tokenize,
    preprocessor=None, stop_words='english', max_features=1000, max_df=.9)  
count_vectorizer.fit(news["data"])

# matrix = count_vectorizer.transform(new_sentense.split())
# print(matrix.todense())
print(count_vectorizer.get_feature_names())
print(count_vectorizer.vocabulary_)

There are few important features if you are using CountVectors

max_df : float in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

min_df : float in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

max_features : int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

This parameter is ignored if vocabulary is not None.

Full Source Code can be seen here https://colab.research.google.com/drive/1oiJ-kXc_Vdt46xOwLSkLqPBSqGTtGPAl

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: