Group by in Pandas, SQL, and NoSQL

MongoDB (NoSQL database)

NoSQL refers to non-SQL or non-relational database design. NoSQL also provides an organized way of storing data but not in tabular form.

There are several NoSQL databases used in the data science ecosystem. In this article, we will be using MongoDB which stores data as documents. A document in MongoDB consists of field-value pairs. Documents are organized in a structure called “collection”. As an analogy, we can think of documents as rows in a table and collections as tables.

The dataset is stored in a collection called marketing. Here is a document in the marketing collection that represents an observation (i.e. a row in a table).

> db.marketing.find().limit(1).pretty()
{
"_id" : ObjectId("6014dc988c628fa57a508088"),
"Age" : "Middle",
"Gender" : "Male",
"OwnHome" : "Rent",
"Married" : "Single",
"Location" : "Close",
"Salary" : 63600,
"Children" : 0,
"History" : "High",
"Catalogs" : 6,
"AmountSpent" : 1318
}

The db refers to the current database. We need to specify the collection name after the dot.

MongoDB provides the aggregate pipeline for data analysis operations such as filtering, transforming, filtering, and so on. For group by operations, we use the “$group” stage in the aggregate pipeline.

The first example is to calculate average spent amount for each age group.

> db.marketing.aggregate([
... { $group: { _id: "$Age", avgSpent: { $avg: "$AmountSpent" }}}
... ]){ "_id" : "Old", "avgSpent" : 1432.1268292682928 }
{ "_id" : "Middle", "avgSpent" : 1501.6909448818897 }
{ "_id" : "Young", "avgSpent" : 558.6236933797909 }

The fields (i.e. column in table) used for grouping are passed to the group stage with the “_id” keyword. We assign a name for each aggregation that contains the field to be aggregated and the aggregation function.

MongoDB (NoSQL database)

Footer