NLP: Text analysis using Spacy with Python

Agenda

  • Intro to NLP and Spacy

  • Pre-processing

  • Sentence segmentation

  • Tokenisation

  • POS tagging

  • Named Entity Recognition

  • Stop word removal

  • Removing punctuation and stripping

  • Lemmatisation

  • Dependency visualisation

  • Frequency analysis

  • Multiprocessing pipelines

Natural Language Processing (NLP)

  • subset of Artificial Intelligence (AI)

  • aims to enable computers to understand human language (text and spoken content)

  • forms part of computational linguistics that

    • make use of rule based matching with statistical analysis
    • machine learning and deep learning models
    • to comprehend the human language
  • With NLP, natural language can be

    • analysed
    • quantified
    • understood
    • derived meaning from (context)
  • Recent research by OpenAi lead to the creation of chatGPT (one of the most advanced natural language models)

Application of NLP

  • text analysis and classification
  • sentiment analysis
  • automatic summarisation
  • chatbots
  • speech recognition
  • translation
  • predictive text

Spacy

  • NLP library
  • Support for over 72 languages
  • offers pre-trained models
  • allows named-entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatisation, morphological analysis, entity linking and more
  • has built-in visualiser
  • support for custom pipelines (tokenisation, tagging, parsing, ner, …)
  • efficient library (memory usage)

Installation of tools

Tools

  • Vs Code
  • Vs Code Extensions (Jupyter Notebooks: Top 4 extensions)

Libraries

  • Pandas

    Allows data manipulation and analysis

    pip install pandas
  • Spacy

    NLP library that enables text analysis

    pip install -U spacy
    python3 -m spacy download en_core_web_sm
  • WordCloud

    Generates word cloud images

    pip install wordcloud
  • Matplotlib

    A practical visualisation library

    pip install matplotlib

How can we analyse text with spacy?

  • Split sentences
  • Tokenise text
  • POS Tagging
  • Remove stop words
  • Lemmatise words
  • Analyse word frequency
  • Visualise dependencies
  • Named-entity recognition

Spacy language processing pipeline

  • Default pipeline:

    Untitled

Getting started with Spacy

#importing spacy and loading the model
import spacy
nlp = spacy.load('en_core_web_sm')

Reading from csv file

#using pandas to read csv
import pandas as pd
df = pd.read_csv('text.csv')

print(df.head())

Csv metadata

Csv metadata

Sentence segmentation

  • Sometimes we have big chunks of text consisting of multiple sentences or paragraphs
  • For text analysis, we need to split the text apart into separate sentences
  • Often considered as preprocessing
#Splitting paragraph into sentences
paragraph=df.para[0]

processed_para=nlp(paragraph) #processed_para is a spacy doc object
# print(processed_para.sents)

sentences=[]
for i,sent in enumerate(processed_para.sents):
    print("{0} : {1}".format(i,sent))
    sentences.append(sent)

sentence=str(sentences[0]) #converting spacy span object to string
print('\n'+sentence)

Segmented sentences

Segmented sentences

Tokenisation

  • First step in most NLP pipelines
  • Splits text into discrete elements (words and punctuation)
  • A token can be a word, punctuation, verb, noun, etc
#Tokenisation
doc = nlp(sentence)

for token in doc:
    print("{0}\t{1}".format(token.text,token.idx)) 
    #idx property gives index of word in sentence

Some tokens from sentence

Some tokens from sentence

How are tokens encoded ?

  • The vocabulary is encoded uses hashes in spacy
  • Strings are encoded as hashes before being saved in a hashmap
#How are tokens encoded in spacy?
print(doc.vocab.strings) #uses a hashmap to store strings
print(doc.vocab.strings['creative']) #convert string to hash
    nlp.vocab.strings[1433653077910583464] #convert hash to string

Output from the code

Output from the code

Part-of-speech tagging

  • process of categorising words depending on

    • definition of word
    • context
  • Example:

    POS tags at the bottom of the words

    POS tags at the bottom of the words

  • POS tags are encoded into shorter for by spacy

  • used to understand the context and meaning of the word in a sentence

#Token attributes and Part-of-speech tagging
for token in doc:
    print("{0} \tTag: {1} \tPOS:{2} \nDescription: {3}\n".format(token,token.tag_,token.pos_,spacy.explain(token.tag_)))

Example of POS tagging

Example of POS tagging

Types of POS tags

Untitled

Named Entity Recognition (NER)

  • A named-entity is a “real-life object” such as a person, a organisation, a location, a time or a product
  • NER is the process of predicting and identifying named entities
#Named entity recogntion
for entity in doc.ents:
    print("{0}    \t{1}\nDescription: {2}\n".format(entity.text,entity.label_,spacy.explain(entity.label_)))

Untitled

Visualising Named Entities

    from spacy import displacy
    displacy.render(doc,style='ent',jupyter=True)

visualisation

visualisation

Stop word removal

  • Stop word removal is the process of removing stop words so that stop words do not hinder other processes such as frequency analysis
  • Stop words are the most common words in a language that are not significant in a sentence
  • Examples: a, the, are, but
#Stop word removal
for token in doc: #displaying stop words
    if token.is_stop:
        print(token)

new_doc=[token for token in doc if not token.is_stop] #removing stop words
print(new_doc)

result

result

Removing punctuation and stripping

  • Stripping the tokens will remove empty spaces
  • Removing punctuation will remove punctuation symbols such as (. , -)
#Removing punctuation
new_doc=[str(token).strip() for token in doc if not token.is_punct] #filtering punctuation and stripping words
print(new_doc)

result

result

Lemmatisation

  • process of reducing a word to its base word (root)
  • the base word(root) is called the lemma
  • for example: For eats, ate, eating , the lemma is eat
  • Helps in normalising text
new_doc=[token.lemma_ for token in doc] #lemmatisation
print(new_doc)

Untitled

Dependencies

  • process of extracting the dependency graph of a sentence
  • to represent its grammatical structure
  • defines the dependency relationship between headwords and their dependents
  • Graph can be explained as follows:
    • words/tokens are nodes
    • dependencies are edges (relationships/arrows)

Analysing Dependencies

for token in doc:
    print("{0} \tHead: {1} \tDependency: {2} \tHead POS:{3}".format(token,token.head.text,token.dep_,token.head.pos_))

Untitled

Dependency Visualisation

displacy.render(doc, style='dep', jupyter = True)

Part of dependencies

Part of dependencies

Frequency analysis

  • The number of times a word appears can be analysed
  • The Counter library can be used for that purpose

Untitled

#Frequency analysis
from collections import Counter
words=[word.text for word in nlp(df.para[3])]
common=Counter(words).most_common(5) #returns top 5 most common words
print(common)

Generating a WordCloud

  • data visualisation technique
  • represents text frequency and importance
  • bigger words have higher frequency

Untitled

#Generating a wordcloud
import matplotlib.pyplot as plt
from wordcloud import WordCloud

joined_text = " ".join(token for token in words)
# print(joined_text)
wordcloud = WordCloud().generate(joined_text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.savefig('wordcloud.png')
plt.show()

Multiprocessing pipelines

  • Spacy allows multiprocessing
  • More than 1 process can be spawned at a time
  • It is inefficient for small datasets but far more useful on large datasets with large batches
#Multiprocessing with spacy
texts=df.para.tolist() #converting dataframe to list
docs = nlp.pipe(texts, n_process=4) #using 4 processes to process texts

for doc in docs:
    print(doc)

Concept analysis

  • pre-trained model
  • concept-based
  • sentiment prediction
#Bonus
#Sentiment analysis library (Senticnet)
from senticnet.senticnet import SenticNet
sn = SenticNet()

for token in doc:
    try:
        if token.pos_ == 'ADJ':
            print("\n{0}".format(token.text))
            print(sn.concept(token.text))
    except:
        pass

Untitled

Future Sessions

Learn the coding skills to
advance your career !