All Courses

Python - Text Classification New Techniques

Ajinkya Gandhi

4 years ago

Text Classification in Python | Insideaiml
Many times, we need to categorize the available text into various categories by some pre-defined criteria. NLTK provides such features as part of various corpora. In the below example we look at the movie review corpus and check the categorization available.

# Lets See how the movies are classified
from nltk.corpus import movie_reviews

all_cats = []
for w in movie_reviews.categories():
    all_cats.append(w.lower())
print(all_cats)
When we run the above program, we get the following output
['neg', 'pos']

Now let's look at the content of one of the files with a positive review. The sentences in this file are tokenized and we print the first four sentences to see the sample.

from nltk.corpus import movie_reviews
from nltk.tokenize import sent_tokenize
fields = movie_reviews.fileids()

sample = movie_reviews.raw("pos/cv944_13521.txt")

token = sent_tokenize(sample)
for lines in range(4):
    print(token[lines])
When we run the above program we get the following output

meteor threat set to blow away all volcanoes & twisters !
summer is here again !
this season could probably be the most ambitious = season this decade with hollywood churning out films 
like deep impact , = godzilla , the x-files , armageddon , the truman show , 
all of which has but = one main aim , to rock the box office .
leading the pack this summer is = deep impact , one of the first few film 
releases from the = spielberg-katzenberg-geffen's dreamworks production company .
Next, we tokenize the words in each of these files and find the most common words by using the FreqDist function from nltk.

import nltk
from nltk.corpus import movie_reviews
fields = movie_reviews.fileids()

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)
print(all_words.most_common(10))
When we run the above program we get the following output

[(,', 77717), (the', 76529), (.', 65876), (a', 38106), (and', 35576), 
(of', 34123), (to', 31937), (u"'", 30585), (is', 25195), (in', 21822)]
That's how python provides the NLTK library for text classification. This is a very brief introduction to how we can use python and NLTK library. Later we will discuss its features in many different ways. 
I hope you enjoyed reading this article and finally, you came to know about Python - Text Classification New Techniques.
For more such blogs/courses on data science, machine learning, artificial intelligence and emerging new technologies do visit us at InsideAIML.
Thanks for reading…
Happy Learning…

Submit Review