Python - Text Classification New Techniques

Ajinkya Gandhi

a year ago

Text Classification in Python | Insideaiml
Text Classification in Python | Insideaiml
Many times, we need to categorize the available text into various categories by some pre-defined criteria. NLTK provides such features as part of various corpora. In the below example we look at the movie review corpus and check the categorization available.

# Lets See how the movies are classified
from nltk.corpus import movie_reviews

all_cats = []
for w in movie_reviews.categories():
    all_cats.append(w.lower())
print(all_cats)
When we run the above program, we get the following output
['neg', 'pos']

Now let's look at the content of one of the files with a positive review. The sentences in this file are tokenized and we print the first four sentences to see the sample.

from nltk.corpus import movie_reviews
from nltk.tokenize import sent_tokenize
fields = movie_reviews.fileids()

sample = movie_reviews.raw("pos/cv944_13521.txt")

token = sent_tokenize(sample)
for lines in range(4):
    print(token[lines])
When we run the above program we get the following output

meteor threat set to blow away all volcanoes & twisters !
summer is here again !
this season could probably be the most ambitious = season this decade with hollywood churning out films 
like deep impact , = godzilla , the x-files , armageddon , the truman show , 
all of which has but = one main aim , to rock the box office .
leading the pack this summer is = deep impact , one of the first few film 
releases from the = spielberg-katzenberg-geffen's dreamworks production company .
Next, we tokenize the words in each of these files and find the most common words by using the FreqDist function from nltk.

import nltk
from nltk.corpus import movie_reviews
fields = movie_reviews.fileids()

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)
print(all_words.most_common(10))
When we run the above program we get the following output

[(,', 77717), (the', 76529), (.', 65876), (a', 38106), (and', 35576), 
(of', 34123), (to', 31937), (u"'", 30585), (is', 25195), (in', 21822)]
That's how python provides the NLTK library for text classification. This is a very brief introduction to how we can use python and NLTK library. Later we will discuss its features in many different ways. 
I hope you enjoyed reading this article and finally, you came to know about Python - Text Classification New Techniques.
For more such blogs/courses on data science, machine learning, artificial intelligence and emerging new technologies do visit us at InsideAIML.
Thanks for reading…
Happy Learning…

Submit Review

We're Online!

Chat now for any query