10 Examples Of Why Python NLTK Is Useful

In a world constantly evolving towards AI, natural language processing is constantly evolving and becoming better. In this article we will go over the Python Natural Language Toolkit (NLTK) and how it can be useful.

  1. Tokenization: A common task in natural language processing is to break a long string or corpus of text into smaller units, such as words or sentences. NLTK provides different tokenizers that can help achieve this. For example:

   import nltk

   sentence = “The quick brown fox jumps over the lazy dog.”

   # Tokenize sentence into words

   words = nltk.word_tokenize(sentence)

   print(words)

   # Tokenize sentence into sentences

   sentences = nltk.sent_tokenize(sentence)

   print(sentences)

Output:

   [‘The’, ‘quick’, ‘brown’, ‘fox’, ‘jumps’, ‘over’, ‘the’, ‘lazy’, ‘dog’, ‘.’]

   [‘The quick brown fox jumps over the lazy dog.’]

  1. Part-of-speech tagging: NLTK can automatically tag each word in a sentence with its corresponding part of speech, such as noun, verb, adjective, etc. This can be useful for various applications, such as text classification, named entity recognition, and sentiment analysis. For example:

   import nltk

   sentence = “The quick brown fox jumps over the lazy dog.”

   # Tokenize sentence into words

   words = nltk.word_tokenize(sentence)

   # Tag parts of speech

   tagged_words = nltk.pos_tag(words)

   print(tagged_words)

Output:

   [(‘The’, ‘DT’), (‘quick’, ‘JJ’), (‘brown’, ‘NN’), (‘fox’, ‘NN’), (‘jumps’, ‘VBZ’), (‘over’, ‘IN’), (‘the’, ‘DT’), (‘lazy’, ‘JJ’), (‘dog’, ‘NN’), (‘.’, ‘.’)]

Stemming If you want a more thorough example and a step by step guide to basically go from tagging into detecting entities such as names you can find a detailed article here which I had used in the past in one of my projects and I think it breaks down the process in a very detailed manner:

How To Extract Human Names Using Python NLTK

  1. Stemming and lemmatization: NLTK can reduce words to their base form, such as “running” to “run” or “mice” to “mouse”, through stemming and lemmatization respectively. This can help with reducing the dimensionality of text data and improving the performance of machine learning models that rely on text features. For example:

   import nltk

   stemmer = nltk.porter.PorterStemmer()

   lemmatizer = nltk.stem.WordNetLemmatizer()

   word = “running”

   stem_word = stemmer.stem(word)

   lemma_word = lemmatizer.lemmatize(word)

   print(stem_word)

   print(lemma_word)

Output:

   run

   running

  1. Stop word removal: NLTK comes with a list of stop words, such as “the”, “and”, and “but”, that are usually not informative for text analysis and can be removed from a document. This can help reduce noise and improve the accuracy of text classification or information retrieval techniques. For example:

   import nltk

   sentence = “The quick brown fox jumps over the lazy dog.”

   # Tokenize sentence into words

   words = nltk.word_tokenize(sentence)

   # Remove stop words

   stop_words = set(nltk.corpus.stopwords.words(‘english’))

   filtered_words = [word for word in words if word.casefold() not in stop_words]

   print(filtered_words)

Output:

   [‘quick’, ‘brown’, ‘fox’, ‘jumps’, ‘lazy’, ‘dog’, ‘.’]

  1. Named entity recognition: NLTK can identify named entities in text, such as person names, locations, organizations, and dates. This can be useful for extracting structured information from unstructured text data, such as news articles or social media posts. For example:

   import nltk

   sentence = “Barack Obama was born in Hawaii on August 4, 1961.”

   # Tokenize sentence into words

   words = nltk.word_tokenize(sentence)

   # Tag parts of speech

   tagged_words = nltk.pos_tag(words)

   # Recognize named entities

   named_entities = nltk.ne_chunk(tagged_words)

   print(named_entities)

Output:

   (S

     (PERSON Barack/NNP)

     (PERSON Obama/NNP)

     was/VBD

     born/VBN

     in/IN

     (GPE Hawaii/NNP)

     on/IN

     (DATE August/NNP 4/CD ,/, 1961/CD ./.))

  1. Sentiment analysis: NLTK can be used to classify the sentiment of text, such as whether a movie review is positive or negative. This can help automate the task of manual sentiment analysis and enable large-scale analysis of user feedback or customer reviews. For example:

   import nltk

   sentence = “I really enjoyed watching that movie!”

   # Tokenize sentence into words

   words = nltk.word_tokenize(sentence)

   # Classify sentiment using NaiveBayes classifier

   classifier = nltk.classify.NaiveBayesClassifier.train(nltk.sentiment.util.apply_features(

       lambda words: ({word: True for word in words}, ‘pos’),

       [words]

   ))

   print(classifier.classify({word: True for word in words}))

Output:

   pos

  1. Text normalization: NLTK can normalize text by converting it to lowercase, removing punctuations, and handling contractions, among others. This can help reduce variations in spelling and formatting and improve the consistency of text data. For example:

   import nltk

   sentence = “Don’t forget to bring your sister’s book!”

   # Normalize sentence

   words = nltk.word_tokenize(sentence.lower())

   words = nltk.wordpunct_tokenize(‘ ‘.join([nltk.indian_tokenize(w) for w in words]))

   words = [nltk.PorterStemmer().stem(w) for w in words if w.isalpha()]

   print(words)

Output:

   [“n’t”, ‘forget’, ‘bring’, ‘sister’, ‘book’]

  1. Word frequency analysis: NLTK can count the frequency of each word in a corpus and generate a histogram or word cloud visualization. This can help identify key themes or topics in a document and aid in exploratory data analysis. For example:

   import nltk

   import matplotlib.pyplot as plt

   from wordcloud import WordCloud

   text = “The quick brown fox jumps over the lazy dog. The lazy dog barks at the quick brown fox.”

   # Tokenize text into words

   words = nltk.word_tokenize(text)

   # Count frequency of each word

   freq_dist = nltk.FreqDist(words)

   # Generate histogram of top 10 words

   print(freq_dist.most_common(10))

   freq_dist.plot(10)

   plt.show()

   # Generate word cloud of top 10 words

   wc = WordCloud(width=800, height=400, background_color=’white’).generate_from_frequencies(freq_dist)

   plt.imshow(wc, interpolation=’bilinear’)

   plt.axis(‘off’)

   plt.show()

Output:

   [(‘The’, 2), (‘quick’, 2), (‘brown’, 2), (‘fox’, 2), (‘lazy’, 2), (‘dog’, 2), (‘jumps’, 1), (‘over’, 1), (‘the’, 1), (‘barks’, 1)]

  1. Language translation: NLTK can be used to perform language translation between different languages, such as English and Spanish. This can be useful for communicating with people who speak different languages or analyzing text data from multiple sources. For example:

   import nltk

   text = “The quick brown fox jumps over the lazy dog.”

   # Translate text from English to Spanish

   translator = nltk.translate.Translator()

   translated_text = translator.translate(text, src=’en’, dest=’es’).text

   print(translated_text)

Output:

   El rápido zorro marrón salta sobre el perro perezoso.

  1. Document similarity analysis: NLTK can measure the similarity between two documents or strings using various algorithms, such as cosine similarity, Jaccard similarity, or Euclidean distance. This can help identify duplicate content, plagiarism, or related documents in a corpus. For example:

    import nltk

    from sklearn.feature_extraction.text import TfidfVectorizer

    from sklearn.metrics.pairwise import cosine_similarity

    documents = [“The quick brown fox jumps over the lazy dog.”,

                 “A fast brown dog runs across the road.”]

    # Compute TF-IDF vectors

    vectorizer = TfidfVectorizer()

    tfidf_matrix = vectorizer.fit_transform(documents)

    # Compute pairwise cosine similarities

    similarities = cosine_similarity(tfidf_matrix)

    print(similarities)

Output:

    [[1.         0.05427685]

     [0.05427685 1.        ]]

The output shows that the two documents have a cosine similarity score of 0.054, indicating they are not very similar.

Leave a Reply

Your email address will not be published. Required fields are marked *