Menu
e-lo
  • Home
    • Tech
    • Inspiration and about
  • Database
    • T-SQL
    • SQL Server quick
    • SQL server docs
    • MySql quick sheet
    • Postgre
    • InfluxDB
  • Programming
    • Automating the boring stuff
    • Python 101
    • Python Docs
    • Python Logging
    • cSharp Overview
    • Powershell Latest
    • Powershell 4 lang ref
  • Linux
    • Top CMD’s
    • Useful CMD Linux
    • ss64 Linux
    • Ubuntu
    • 30 things Ubuntu 18.04
  • Azure
    • AZ-104-MS Azure Administrator 101 quick ref
    • ARM Docs
    • ARM Tutorial
    • Azure PS ref
    • Azure CLI ref
    • Deployment scripts (intern/extern)
    • ARM Quickstart
    • ARM templates 4h
  • AzAdm
    • AD 101
    • Governance and Compliance 102
    • Administration 103
    • Virtual Networking 104
    • Storage 107 (With table (NoSQL and more))
    • Virtual Machines 108
    • Azure Virtual Machines 101
    • Monitor VM (and market)
  • Zen
    • Not thinking about anything is Zen
e-lo

TextBlob 101 Information Retrieval

Posted on August 1, 2018August 8, 2018 by espenk
TextBlob https://www.analyticsvidhya.com/blog/2018/02/natural-language-processing-for-beginners-using-textblob/

 

  • Information retrieval (IR) is the activity of obtaining information system resources relevant to an information need from a collection of information resources.
  • Searches can be based on full-text or other content-based indexing.
  • Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for metadata that describe data, and for databases of texts, images or sounds.

I tried to run the following code, but it turns out that it was a bit late. I forgot to download the corpora.

from textblob import TextBlob
blob = TextBlob(MyCommand.elo.data_txt(book))
print(blob.sentences)

The MyCommand.elo.data_txt(book) is returning the actual txt saved in db. The actual text saved here is Leviathan, by Thomas Hobbes (1651), the introduction chapter.
But the cmd yield was:

 

  • For the uninitiated – practical work in Natural Language Processing typically uses large bodies of linguistic data, or corpora.
  • In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed).
  • In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.
  • A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus).

 

I download the nltk corpora to the env:
(e-locmd) C:\Users\espen\virtualEnvs\e-locmd\app>python -m textblob.download_corpora
Ready to move on.

 

Tokenization
Tokenization refers to dividing text or a sentence into a sequence of tokens, which roughly correspond to “words”. This is one of the basic tasks of NLP.
blob = TextBlob(MyCommand.elo.data_txt(book))
# print(blob.sentences)
for x in range(len(blob.sentences)):
print(blob.sentences[x] +”\n”)

 

The output here is sentence + newline to separate them all.

 

 

The actual output for blob.sentences is a list on the format:
[Sentence(“sentence0”), Sentence(“sentence1”)]
So now we have all the sentences, the next step is to get the words.
Lets take sentence 0 and get the words:

 

 

Here you see some of the result for code:
for word in blob.sentences[0].words: print(word)

 

Noun Phrase Extraction
Since we extracted the words in the previous section, instead of that we can just extract out the noun phrases from the textblob.
Noun Phrase extraction is particularly important when you want to analyze the “who” in a sentence.
(Substantiv:Fellesnavn er ord du kan sette artiklene en, ei eller et foran)
Lets extract the noun from the introduction:
noun = []
for np in blob.noun_phrases:
noun.append(np)
tu = (name, blob, noun)

 

 

Here you can see a list (not perfect) of nouns, remember we are woking with machines. So this has to be worked on, but it gives an image of nouns in the text.
There are some duplicates here, we can remove those with a set. Sets are unordered collections of distinct objects.
tmp = MyCommand.elo.data_txt(book)
name = tmp[0]
note = tmp[1]
noun = tmp[2]
print(len(noun))
s = set(noun)
print(len(s))

 

 

Here you can see that all nouns are 123, and nouns to set is 104, i.e word god is just 1 time.

RSS Azure

  • Azure and HITRUST publish shared responsibility matrix January 14, 2021

RSS Python

  • PEP 649: Deferred Evaluation Of Annotations Using Descriptors January 11, 2021

Cloud

ARM (7) azure (22) cmd (1) Django (4) Docker (1) e-lo (2) Flask (2) Github (9) Grafana (2) Information (1) Information Retrieval (11) JAVA (1) kivy (2) Kotlin (4) linux (11) mobile (2) Natural Language Prossesing (NLP) (2) Net.Core (1) Networking and Security (2) OPC (2) PEP8 (1) Philosophy (3) Python (41) Python Networking and Security (2) Reason (2) RMQ (2) Solr (11) Sql (10) VSC (1) Warframe (2) WMVARE (4) Zabbix (7)

Recent Posts

  • 2 TODO MS ARM Template 4h
  • TODO Cryptography with Python – Caesar Cipher
  • 3 TODO Udemy AZ-104 Microsoft Azure Administrator Exam Certification (Scott Duffy)
  • 1 TODO ARM Lab 104 MS (Deployment and more)
  • TODO Prerequisites for Azure administrators 101

Archives

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org
©2021 e-lo | Powered by WordPress & Superb Themes