e-lo,  Natural Language Prossesing (NLP)

TextBlob 101 Information Retrieval

 

  • Information retrieval (IR) is the activity of obtaining information system resources relevant to an information need from a collection of information resources.
  • Searches can be based on full-text or other content-based indexing.
  • Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for metadata that describe data, and for databases of texts, images or sounds.

I tried to run the following code, but it turns out that it was a bit late. I forgot to download the corpora.

from textblob import TextBlob
blob = TextBlob(MyCommand.elo.data_txt(book))
print(blob.sentences)

The MyCommand.elo.data_txt(book) is returning the actual txt saved in db. The actual text saved here is Leviathan, by Thomas Hobbes (1651), the introduction chapter.
But the cmd yield was:

 

  • For the uninitiated – practical work in Natural Language Processing typically uses large bodies of linguistic data, or corpora.
  • In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed).
  • In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.
  • A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus).

 

I download the nltk corpora to the env:
(e-locmd) C:\Users\espen\virtualEnvs\e-locmd\app>python -m textblob.download_corpora
Ready to move on.

 

Tokenization
Tokenization refers to dividing text or a sentence into a sequence of tokens, which roughly correspond to “words”. This is one of the basic tasks of NLP.
blob = TextBlob(MyCommand.elo.data_txt(book))
# print(blob.sentences)
for x in range(len(blob.sentences)):
print(blob.sentences[x] +”\n”)

 

The output here is sentence + newline to separate them all.

 

 

The actual output for blob.sentences is a list on the format:
[Sentence(“sentence0”), Sentence(“sentence1”)]
So now we have all the sentences, the next step is to get the words.
Lets take sentence 0 and get the words:

 

 

Here you see some of the result for code:
for word in blob.sentences[0].words: print(word)

 

Noun Phrase Extraction
Since we extracted the words in the previous section, instead of that we can just extract out the noun phrases from the textblob.
Noun Phrase extraction is particularly important when you want to analyze the “who” in a sentence.
(Substantiv:Fellesnavn er ord du kan sette artiklene en, ei eller et foran)
Lets extract the noun from the introduction:
noun = []
for np in blob.noun_phrases:
noun.append(np)
tu = (name, blob, noun)

 

 

Here you can see a list (not perfect) of nouns, remember we are woking with machines. So this has to be worked on, but it gives an image of nouns in the text.
There are some duplicates here, we can remove those with a set. Sets are unordered collections of distinct objects.
tmp = MyCommand.elo.data_txt(book)
name = tmp[0]
note = tmp[1]
noun = tmp[2]
print(len(noun))
s = set(noun)
print(len(s))

 

 

Here you can see that all nouns are 123, and nouns to set is 104, i.e word god is just 1 time.
Comments Off on TextBlob 101 Information Retrieval