Data manipulation in Python is synonymous with NumPy array manipulation: Even Pandas is built around the NumPy array. Although some operations may seem a bit dry, they’re the building blocks of many other operations. So get to know them well.
2.1 NumPy array attributes
First let’s discuss some useful array attributes for random arrays: a one-, two-, and three-dimensional array. Let’s use NumPy’s random number generator and seed it with a set value in order to ensure that the same random arrays are generated each time we run the same code:
Each array has attributes ndim
(the number of dimensions), shape
(the size of each dimension), and size
(the total size of the array):
2.2 Array indexing: Accessing single elements
If you’re familiar with Python’s standard list indexing, then this will be a piece of cake, and NumPy will feel very familiar to you. In a one-dimensional array, you can access the ith value (counting from zero) by specifying the index you want in square brackets, just as we do with Python lists:
In a multidimensional array, you access and modify items using a comma-separated tuple of indices:
Bear in mind that NumPy arrays have a fixed type (unlike Python lists). This means that if you try to insert a floating-point value to an integer array, the value will be silently truncated.
2.3 Array slicing: Accessing subarrays
Just like using square brackets to access individual elements, we can also use them to access subarrays with the slice notation, specified by the colon (:
) character. The NumPy slicing syntax isn’t different from that of the standard Python list. So to access a slice of an array x, remember:
x[start:stop:step]
If any of these are unspecified, they default to the values start=0
, stop=size
of dimension, and step=1
. Let’s take a look at accessing subarrays in one dimension.
Now, let’s see it working in a multidimensional subarray:
2.4 Reshaping of arrays
Another useful type of operation is reshaping of arrays. The most optimised way of doing this is by using the reshape()
method. For instance, if you want to put the numbers 1
through 9
into a 3×3 grid, you can do the following:
Note: For reshape()
to work, the size of the initial array must match the size of the reshaped array.
Another common reshaping pattern is the conversion of a one-dimensional array into a two-dimensional row or column matrix. You can do this with the reshape method or by using the newaxis
keyword within a slice operation:
2.5 Array concatenation and splitting
All of the previous routines worked on single arrays. But as a data scientist, you’ll often have to combine multiple arrays into one and split a single array into multiple arrays. So concatenation, or the joining of two arrays in NumPy, is primarily done using one of these routines: np.concatenate
, np.vstack
, or np.hstack
.
For working with arrays of mixed dimensions, it can be clearer to use the np.vstack
(vertical-stack) and np.hstack
(horizontal-stack) functions:
2.6 Splitting of arrays
The opposite of concatenation is splitting, which can be achieved by using the functions np.split
, np.hsplit
, and np.vsplit
. For each of these, we can pass a list of indices giving the split points:
Notice that np.split
points lead to N+1 sub-arrays. The related functions np.hsplit
and np.vsplit
are no different: