Skip to content
Menu
e-lo [IT Engineer life]
  • Home
    • Note
  • Database
    • T-SQL
    • SQL Server quick
    • SQL server docs
    • MySql quick sheet
    • Postgre
    • InfluxDB
  • Programming
    • MS Azure Powershell
    • MS Azure Command-Line Interface (CLI) doc
    • Python Docs
    • Python Logging
    • Python-cheat-sheet
    • Git-guide
  • Azure
    • MS Windows virtual machines in Azure
    • MS ARM Docs
    • MS ARM Template Docs
    • MS ARM Functions
    • MS Bicep+ARM
    • MS ARM Tutorial
    • MS Deployment scripts (intern/extern)
    • MS Virtual Network
  • Az-nutshell
    • ms-technology-choices-compute-decision-tree
    • ms-data-store-decision-tree
    • ms-data-explorer
    • ms-storage-explorer
    • ms-azure-sql
    • ms-common-data-services
    • ms-azure-mysql-daas
    • ms-sla
    • az paas
    • az glossary-quicksheet
    • az-test-vm-script-quickguide
  • Linux
    • Top CMD’s
    • Useful CMD Linux
    • ss64 Linux
    • Ubuntu
    • 30 things Ubuntu 18.04
    • Bootable Ubuntu USB
    • LinuxFilesystemTreeOverview
  • Sys Admin
    • System Administrator
    • Sys News
  • Zen
    • Not thinking about anything is Zen
e-lo [IT Engineer life]

TextBlob 101 Information Retrieval

Posted on August 1, 2018August 8, 2018 by espenk
TextBlob https://www.analyticsvidhya.com/blog/2018/02/natural-language-processing-for-beginners-using-textblob/

 

  • Information retrieval (IR) is the activity of obtaining information system resources relevant to an information need from a collection of information resources.
  • Searches can be based on full-text or other content-based indexing.
  • Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for metadata that describe data, and for databases of texts, images or sounds.

I tried to run the following code, but it turns out that it was a bit late. I forgot to download the corpora.

from textblob import TextBlob
blob = TextBlob(MyCommand.elo.data_txt(book))
print(blob.sentences)

The MyCommand.elo.data_txt(book) is returning the actual txt saved in db. The actual text saved here is Leviathan, by Thomas Hobbes (1651), the introduction chapter.
But the cmd yield was:

 

  • For the uninitiated – practical work in Natural Language Processing typically uses large bodies of linguistic data, or corpora.
  • In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed).
  • In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.
  • A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus).

 

I download the nltk corpora to the env:
(e-locmd) C:\Users\espen\virtualEnvs\e-locmd\app>python -m textblob.download_corpora
Ready to move on.

 

Tokenization
Tokenization refers to dividing text or a sentence into a sequence of tokens, which roughly correspond to “words”. This is one of the basic tasks of NLP.
blob = TextBlob(MyCommand.elo.data_txt(book))
# print(blob.sentences)
for x in range(len(blob.sentences)):
print(blob.sentences[x] +”\n”)

 

The output here is sentence + newline to separate them all.

 

 

The actual output for blob.sentences is a list on the format:
[Sentence(“sentence0”), Sentence(“sentence1”)]
So now we have all the sentences, the next step is to get the words.
Lets take sentence 0 and get the words:

 

 

Here you see some of the result for code:
for word in blob.sentences[0].words: print(word)

 

Noun Phrase Extraction
Since we extracted the words in the previous section, instead of that we can just extract out the noun phrases from the textblob.
Noun Phrase extraction is particularly important when you want to analyze the “who” in a sentence.
(Substantiv:Fellesnavn er ord du kan sette artiklene en, ei eller et foran)
Lets extract the noun from the introduction:
noun = []
for np in blob.noun_phrases:
noun.append(np)
tu = (name, blob, noun)

 

 

Here you can see a list (not perfect) of nouns, remember we are woking with machines. So this has to be worked on, but it gives an image of nouns in the text.
There are some duplicates here, we can remove those with a set. Sets are unordered collections of distinct objects.
tmp = MyCommand.elo.data_txt(book)
name = tmp[0]
note = tmp[1]
noun = tmp[2]
print(len(noun))
s = set(noun)
print(len(s))

 

 

Here you can see that all nouns are 123, and nouns to set is 104, i.e word god is just 1 time.

RSS Azure

  • Scale your cloud-native apps and accelerate app modernization with Azure, the best cloud for your apps May 24, 2022

RSS RabbitMQ

  • RabbitMQ 3.8.15 release

RSS Python

  • PEP 691: JSON-based Simple API for Python Package Indexes May 4, 2022

Tags

5 min (26) Ansible (1) ARM (10) azure (40) cmd (3) Django (4) Docker (1) e-lo (2) Flask (2) Github (9) Grafana (2) Information (7) Information Retrieval (13) JAVA (1) kivy (2) Kotlin (6) linux (15) mobile (2) Natural Language Prossesing (NLP) (2) Net.Core (1) Networking and Security (6) OPC (2) PEP8 (1) Philosophy (3) Python (47) Python Networking and Security (5) Reason (2) RMQ (3) Solr (11) Sql (10) Uncategorized (2) VSC (1) Warframe (2) WMVARE (4) Zabbix (7)

Recent Posts

  • 5 min Logic App Storage Table
  • 5 min Logic App PSQL
  • 5 min Logic App
  • 5 MIN Azure Data Explorer
  • TODO Build a Hash Table in Python With TDD Real Python

Archives

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org

Photo by Markus Spiske from Pexels "Matrix"

©2022 e-lo [IT Engineer life] | Powered by WordPress & Superb Themes