All Courses

Project 3: Finding Customer Concern Areas using NLP

Shashank Shanu

2 years ago

Customer Concern Areas Using NLP | insideAIML 
Table of Contents
  • Problem Statement
  • Dataset description
  • Visualiazation
  • Creating wordcloud
  • Data Preprocessing
            1. Cleaning the Data
           2. Plotting frequency distribution
           3. Checking stopwords and punctuations
           4. Applying lemmatization
  • Plot a bigram bar graph on top 25 words.
  • Converting to Document term matrix
  • Conclusion

Problem Statement

          To find out what are the highly concerned areas of the customers based on the given Amazon dataset using NLP techniques.

Dataset description

  • The dataset is scrapped from the amazon's website where summary columns is the reviews given by the customers on the product.
  • Note: - It's a small dataset which is being used for only showing you how we can find customers concern Areas using NLP techniques

Importing libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Importing data and displaying it

amazon_reviews = pd.read_csv('amazon_reviews.csv')
amazon_reviews.head()
Output:
Output | insideAIML

Checking shape of the dataset

amazon_reviews.shape
Output:
(999, 10)
  • We can see that our dataset consists of 999 observations and 10 columns.

Creating dataset

  • As we see, in our dataset we only need the reviews and the overall rating given by the customers. So we will be creating a new dataset which only contains these two columns.
reviews = pd.DataFrame()
reviews['reviewText'] = amazon_reviews['reviewText']
reviews['overall'] = amazon_reviews['overall']
reviews.head()
Output:
Output | insideAIML

Checking missing values

reviews.isnone().sum()
Output:
reviewText    2
overall       0
dtype: int64

Visualiazation

Let's plot and check none values with the help heatmap
sns.heatmap(reviews.isnone())
Output:
Output | insideAIML
  • We can observe that there are only 2 missing values present in our dataset. So we can drop them or move forward as it is very less in numbers so it will not much affect our result.
reviews = reviews[reviews['reviewText'].notnone()]
reviews.head()
Output | insideAIML
reviews.shape
(997, 2)
  • Note- Here, we are removing the missing values and then creating a dataset with no none values. So the shape of the dataset changes as shown above

Combining all the reviews into string so the to perform NLP techniques

combined_reviews = " ".join(reviews['reviewText'])
print(combined_reviews[:1200])
Output | insideAIML
type(combined_reviews)
Output:
str

Creating wordcloud

What is the word cloud?

  • Text data has grown exponentially in recent years resulting in an ever-increasing need to analyze the massive amounts of such data. Word Cloud provides an excellent option to analyze the text data through visualization in the form of tags, or words, where the importance of a word is explained by its frequency.
from wordcloud import WordCloud
word_cloud = WordCloud(width = 1000, height = 600, background_color = 'white', max_words = 150).generate(combined_reviews)
plt.figure(figsize = (12,6))
plt.imshow(word_cloud)
plt.axis('off')
plt.tight_layout(pad = 0)
plt.show()
Creating wordcloud | insideAIML

Find frequency of each word

list1 = []
for i in combined_reviews.split():
    list1.append(i)
dictionary1 = {}
for j in list1:
    dictionary1[j] = dictionary1.get(j,0)+1
series1 = pd.Series(dictionary1)
word_freq = pd.DataFrame(series1)
word_freq = word_freq.reset_index().rename(columns = {'index':'Words', 0:'Frequency'})
word_freq.head()
Find frequency of each word | insideAIML

Checking top 25 maximum frequency words

top_25_words = word_freq.sort_values(ascending = false, by = 'Frequency')
top_25_words.head(25)
maximum frequency words | insideAIML

Checking last 25 maximum frequency words

last_25_words = word_freq.sort_values(ascending = false, by = 'Frequency')
last_25_words.tail(25)
maximum frequency words | insideAIML

Data Preprocessing

1. Cleaning the Data

Applying Tokenization

What is Tokenization?
  • Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either word, characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram characters) tokenization.
  • For example, consider the sentence: “Never give up”.
  • The most common way of forming tokens is based on space. Assuming space as a delimiter, the tokenization of the sentence results in 3 tokens – Never-give-up. As each token is a word, it becomes an example of Word tokenization.
  • Similarly, tokens can be either characters or subwords. For example, let us consider “smarter”
  • Character tokens: s-m-a-r-t-e-r.
  • Subword tokens: smart-er
from nltk.tokenize import word_tokenize
all_words = word_tokenize(combined_reviews.lower())
print(all_words[:200])
Output:
Output
from nltk.probability import FreqDist
fdist = FreqDist(all_words)
fdist
Output:
FreqDist({'the': 7854, '.': 6677, ',': 5824, 'i': 4497, 'to': 4419, 'and': 4210, 'a': 3900, 'it': 3656, 'is': 2497, 'for': 2097, ...})

2. Plotting frequency distribution

What is frequency distribution?

  • A frequency distribution for the outcomes of an experiment. A frequency distribution records the number of times each outcome of an experiment has occurred. For example, a frequency distribution could be used to record the frequency of each word type in a document. Formally, a frequency distribution can be defined as a function mapping from each sample to the number of times that sample occurred as an outcome.
plt.figure(figsize = (10,6))
fdist.plot(25, cumulative = false)
plt.show()
Output:
frequency distribution | insideAIML

3. Checking stopwords and punctuations

What is Stopwords?

  • A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.
  • We would not want these words to take up space in our database, or taking up the valuable processing time. For this, we can remove them easily, by storing a list of words that you consider to stop words. NLTK(Natural Language Toolkit) in python has a list of stopwords stored in 16 different languages.
from nltk.corpus import stopwords
from string import punctuation
stop_words = stopwords.words('english')
print(stop_words)
print(list(punctuation))
Output:
Checking stopwords and punctuations | insideAIML
stop_words_updated = stop_words + ['..', '...', 'will', 'would', 'can', 'could', "n't"]
print(stop_words_updated)
Output:
['got', 'gps', 'husband', 'otr', 'road', 'trucker', 'impressed', 'shipping', 'time', 'arrived', 
'days', 'earlier', 'expected', 'within', 'week', 'use', 'however', 'started', 'freezing', 
'glitch', 'unit', 'worked', 'great', 'worked', 'work', 'great', 'normal', 'person', 'well', 
'trucker', 'option', 'big', 'truck', 'routes', 'tells', 'scale', 'coming', 'ect', 'love', 
'bigger', 'screen', 'ease', 'use', 'ease', 'putting', 'addresses', 'memory', 'nothing', 'really', 'bad']

4. Applying lemmatization

What is Lemmatization?

  • Lemmatization is the process of arriving at a lemma of a word.

What is Lemma?

  • Lemma is the root from which a word is formed. For example, given the word went, the lemma would be 'go' since went is the past form of go.

Why do Lemmatization?

  • NLP constitutes varied tasks and the core of this task is to deal with language corpus.
  • Collect Data → Clean (Remove unnecessary parts of it) -> Train, Test, Evaluate
  • When we begin an NLP task, we decide our goal and the language that we want to deal with, first. Say, you want to build a Speech to Text system for English. The first step in the direction would be to build training data. (Which in the given case, would be a large corpus of millions of English sentences)
  • In building this training data, we clean it. And this is where Lemmatization gains importance. It becomes important when you wish to clean the data of all words of a given root. Instead of manually going to each word and checking it, the system searches for all words which have been lemmatized from a given root and then removes them all at once.
  • Additionally, Lemmatization helps in information retrieval systems.
  • We can summarize it as:NLP tasks require dealing with language corporaBefore training a system with data, the data (for the given task and language) needs to be cleanedLemmatization is important at this cleaning stage. It converts all words to their roots or lemma.A closely related process is that of stemming. A stemmer takes in a word and gives its stem.
from nltk.stem import WordNetLemmatizer
lemma = WordNetLemmatizer()
lemma_words = [lemma.lemmatize(i) for i in all_words_updated]
print(len(set(lemma_words)))
Output:
9038

Create new dataframe after Text Cleaning

def clean_text(text):
    token = word_tokenize(text.lower())
    lemm = [lemma.lemmatize(i) for i in token if i not in stop_words_updated\
           and i not in list(punctuation) and len(i) > 2]
    sentence = ' '.join(lemm)
    return sentence

reviews['clean_reviewText'] = reviews['reviewText'].apply(clean_text)
reviews.head()
Output:
Create new dataframe after Text Cleaning | insideAIML

Plot a bigram bar graph on top 25 words

What is n-Grams?

  • N-grams of texts are extensively used in text mining and natural language processing tasks. They are basically a set of co-occurring words within a given window and when computing the n-grams you typically move one word forward (although you can move X words forward in more advanced scenarios).
  • For example, for the sentence “The cow jumps over the moon”. If N=2 (known as bigrams), then the ngrams would be:the cowcow jumpsjumps overover thethe moon
So you have 5 n-grams in this case. Notice that we moved from the->cow to cow->jumps to jumps->over, etc, essentially moving one word forward to generate the next bigram.
If N=3, the n-grams would be:
  • the cow jumps
  • cow jumps over
  • jumps over the
  • over the moon
So you have 4 n-grams in this case. When N=1, this is referred to as unigrams and this is essentially the individual words in a sentence. When N=2, this is called bigrams and when N=3 this is called trigrams. When N>3 this is usually referred to as four grams or five grams and so on.

Creating Bigrams

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(ngram_range = (2,2))
bigrams = count_vect.fit_transform(reviews['clean_reviewText'])
print(count_vect.get_feature_names()[:100])
Output:
Creating Bigrams | insideAIML

Converting to Document term matrix

What is Document term matrix?

  • 1. A DTM is a table that describes the frequency of terms that occur in a collection of documents. Typically, DTMs are sparse matrices.
  • 2. Is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
DTM = pd.DataFrame(bigrams.toarray(), columns = count_vect.get_feature_names())
DTM.head()
Output:
Document term matrix | insideAIML
top_25_bigrams = DTM.sum().sort_values(ascending = false).head(25)
top_25_bigrams
Output:
Output | insideAIML
top_25_bigrams.plot(kind = 'bar', figsize = (16,8))
plt.show()
Output:
Output of top 25 bigrams plot | insideAIML
Find customer concern areas - the top 25 bigrams which includes the below negative words indicating the customer concern areas: 'poor', 'waste', 'bad', 'defective', 'disgusting', 'untrusty', 'worst', 'horrible', 'unexpectedly', 'slow'
bigrams = DTM.columns
print(bigrams)
Output:
Index(['00 100', '00 300', '00 believe', '00 best', '00 came', '00 dollar',
       '00 free', '00 great', '00 hit', '00 hold',
       ...
       'zoom touch', 'zoom type', 'zoom update', 'zoom use', 'zooming awkward',
       'zooming feature', 'zooming scrolling', 'zs15 photo', 'zune place',
       'zune ve'],
      dtype='object', length=55110)
negative_words = ['poor', 'waste', 'bad', 'defective', 
                  'disgusting', 'untrusty', 'worst', 
                  'horrible', 'unexpectedly', 'slow']

negative_bigrams = []
for i in bigrams:
    words = i.split()
    if sum(np.in1d(words, negative_words)) >= 1:
        negative_bigrams.append(i)
DTM_subset = DTM[negative_bigrams]
top_25_cutomer_concern_areas = DTM_subset.sum().sort_values(ascending = false).head(25)
top_25_cutomer_concern_areas
Output:
Output | insideAIML

Plot a bar graph for the top 25 customer concern areas.

top_25_cutomer_concern_areas.plot(kind = 'bar', figsize = (16,8))
Output:
Output | insideAIML
  • From the graphs, we can observe that these are the top 25 concern areas of the customers

Conclusion

  • After Applying some the NLP techniques such as text cleaning, tokenization, lemmatization and more. We are able to find the top concerned areas of the customers.
  • By these concern areas the organization or company can work more on these areas and which increases customers satisfaction rate and also helps to increase company sales and revenue and brand in the market.
I hope you enjoyed this project and also you came to know about how we can use and implement NLP techniques and find customers concern area.
    
For more such blogs/courses on data science, machine learning, artificial intelligence and emerging new technologies do visit us at https://insideaiml.com/home.
Thanks for reading…
Happy Learning…

Submit Review