Top NLP(Natural Language Processing) Projects Using Python (Includes links to Repository on Github)

Github

Official Documentation

Bear as service is a sentence encoding service for mapping a variable-length sentence to a fixed-length vector for Python users.

BERT is an NLP model developed by Google for pre-training language representations. It leverages an enormous amount of plain text data publicly available on the web and is trained in an unsupervised manner. Pre-training a BERT model is a fairly expensive yet one-time procedure for each language. Fortunately, Google released several pre-trained models where you can download from here.

Sentence Encoding/Embedding is an upstream task required in many NLP applications, e.g. sentiment analysis, text classification. The goal is to represent a variable-length sentence into a fixed-length vector, e.g. hello world to [0.1, 0.3, 0.9]. Each element of the vector should “encode” some semantics of the original sentence.

Finally, bert-as-service uses BERT as a sentence encoder and hosts it as a service via ZeroMQ, allowing you to map sentences into fixed-length representations in just two lines of code.

What makes it special?

state of the art
easy to use
fast
scalable
reliable

Github

Official Documentation

Textblob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks.

Simple, Pythonic, text processing library “Textblob” is known for:

Sentiment analysis,
part-of-speech tagging,
noun phrase extraction,
translation,
and more.

TextBlob stands on the giant shoulders of NLTK and pattern, and plays nicely with both.

Features it offers:

Noun phrase extraction
Part-of-speech tagging
Sentiment analysis
Classification (Naive Bayes, Decision Tree)
Tokenization (splitting text into words and sentences)
Word and phrase frequencies
Parsing
n-grams
Word inflection (pluralization and singularization) and lemmatization
Spelling correction
Add new models or languages through extensions
WordNet integration

Github

Ciphey is a library that automatically decrypts encryptions without knowing the key or cipher, decodes encodings, and crack hashes.

It is a fully automated decryption/decoding/cracking tool in which u input encrypted text, and get the decrypted text back using natural language processing & artificial intelligence, along with some common sense.

The question may arise What type of encryption?

That’s the point. You don’t know, you just know it’s possibly encrypted. Ciphey will figure it out for you. Ciphey can solve most things in 3 seconds or less.

Ciphey aims to be a tool to automate a lot of decryptions & decodings such as multiple base encodings, classical ciphers, hashes or more advanced cryptography.

If you don’t know much about cryptography, or you want to quickly check the ciphertext before working on it yourself, Ciphey is for you.

Why Ciphey?

50+ encryptions/encodings
Custom Built Artificial Intelligence with Augmented Search (AuSearch) for answering the question “what encryption was used?”
The custom-built natural language processing module
Multi-Language Support
Supports encryptions and hashes

Github

Official Documentation

Doccano is an open-source text annotation tool for machine learning practitioners.

It provides annotation features for text classification, sequence labelling and sequence to sequence tasks. So, you can create labelled data for sentiment analysis, named entity recognition, text summarization and so on. Just create a project, upload data and start annotating. You can build a dataset in hours.

Features

Collaborative annotation
Multi-language support
Mobile support
Emoji support
Dark theme
RESTful API

Github

LazyNLP is an open-source library to scrape and clean web pages to create massive datasets.

A straightforward library that allows you to crawl, clean up, and deduplicate webpages to create massive monolingual datasets. Using this library, you should be able to create datasets larger than the one used by OpenAI for GPT-2.

This library uses Python 3 and uses URLs of the webpages to download the dataset by scraping.

Github

Official Documentation

Textract is an open-source library to extract text from any document without any muss or fuss. This package provides a single interface for extracting content from any type of file, without any irrelevant markup.

What makes it special?

Simple, Pythonic, text processing library “Textblob” is known for:

Features it offers:

Why Ciphey?

Features

Footer