There is no discussion that Python is one of the most popular programming languages for Data Scientist s— and it makes sense. Python and more specifically Python Package Index (PyPI) has an impressive number of data science libraries and packages, like for example NumPy, SciPy, Pandas, Matplotlib and the list goes on and on. So you put that together with a massive developer community, plus a language with relatively low learning curve (sorry if you got offended by this last part, it is what it is, get over it), makes Python a great choice for Data Science.
After taking a better look into some of these libraries I found out that a lot of them are actually implemented in C and C++ for obvious reasons, better perfomance over all, and providing foreign function interfaces (FFIs) or Python bindings so you can call those functions form Python it self. It’s no secret that pure Python is not the most performant of programming languages, I don’t know the exact number and hey don’t go quoting me on it, but I heard that in some cases Python can be 100x slower than C or C++. Anyway, getting back to the point, these lower level languages implementations have better execution time and also memory management. Putting these two together makes everything more scalable and therefore cheaper also. So if you can have a more performant code to accomplish data science tasks and the integrate them with your Python code, why not?
This is where Rust comes in! The Rust Language is known for a lot of things and based on what I described on the previous paragraph is aligns quite well with languages like C and C++. In fact, performant wise Rust is directly comparable with C and C++ and in a lot of ways is even better because it provides total (or almost) memory safety, extensive thread safety and no runtime overhead, which makes it for a perfect candidate for Data Science related problems. Lots and lots of data processing.
On this post, my plan is to take a simple task and compare it between 3 difference scenarios:
- Pure Python Code
- Python Code using data science libraries
- Python Code invoking Pure Rust code compiled into a lib.
Well, since data science is very broad subject and I am definitely not the expert on it, I decided to go with simple data science task that is to compute the information entropy for a byte sequence. This is formula for calculating entropy in bit (source: Wikipedia: Entropy)
H(X)=-𝚺i Px(xi) log2Px(xi)
In information theory, the entropy of a random variable is the average level of “information”, “surprise”, or “uncertainty” inherent in the variable’s possible outcomes.
Anyway, this is a somewhat simple task but quite used tool in the world of data science and machine learning and it is used as a basis for technique such as feature selection, building decision trees, and, more generally, fitting classification models. Anyhow, this is what we are going to do.
Based on our formula (“our formula”, haha), to compute the entropy of a random variable X, we first count the occurrences of each possible byte value (Xi) and divide by the total number of occurrences to calculate the probabilities of a particular value, Xi, occurring (Px(Xi)). Then we calculate the negative of the weighted sum of the probability of a particular value, Xi, occurring (Px(Xi)) and so-called self-information (log2Px(xi)). The log with base 2 is because we are working with bits, so we use the notation log2.
I know this is a simplistic assessment and my goal here is not to attack Python or any of the Python popular data science libraries. The goal is just to see how Rust would do against them, even on a simple scenario like this. And who knows what the future might bring?!
In these tests we will compile Rust into a custom C library that we can import from Python. All tests are ran on a macOS Catalina.
Pure Python
We can start by creating a new file called entropy.py
where we will have our main code. On this first part we will import the standard library module math
and use it to create a new function to calculate the entropy of a bytearray. This function is not optimized in any way and provides a baseline for our perfomance measurements.
import math def compute_entropy_pure_python(data):
"""Compute entropy on bytearray `data`."""
counts = [0] * 256
entropy = 0.0
length = len(data) for byte in data:
counts[byte] += 1 for count in counts:
if count != 0:
probability = float(count) / length
entropy -= probability * math.log(probability, 2) return entropy
Python with Data Science Libraries (NumPy and Scipy)
Here we will just continue on the same file as before, entropy.py
and add a couple of imports and another function, but this time making use of the libraries we imported. As you can imagine, SciPy already has a function that calculates entropy. We will just use NumPy unique()
function to calculate the byte frequencies first. To be honest comparing the performance of SciPy’s functions against pura python is not even fair, but who said that life is fair, so let’s keep going.
import numpy as np
from scipy.stats import entropy as scipy_entropydef compute_entropy_scipy_numpy(data):
"""Compute entropy on bytearray `data` with SciPy and NumPy."""
counts = np.bincount(bytearray(data), minlength=256)
return scipy_entropy(counts, base=2)
Python with Rust
Now the fun part! Sorry! Just kidding, Python is also fun.
Now we will go step by step into the Rust implementation and necessary steps to make Rust work with Python.
First step is to create a new Rust Library project. I did it in the same directory of my entropy.py
file, to make things easier.
$ cargo new rust_entropy --lib
This will create a new directory called rust_entropy
tell Cargo to create new lib project.
Now we need to make some necessary modifications to our Cargo.toml
manifest file.
Cargo.toml
[package]
name = "rust_entropy"
version = "0.1.0"
authors = ["YOUR NAME <YOUR EMAIL>"]
edition = "2018"[lib]
name = "rust_entropy_lib"
crate-type = ["dylib"][dependencies]
cpython = { version = "0.5.2", features = ["extension-module"] }
pyo3 = { version = "0.12.1", features = ["python3"] }
Here we are defining the library name and crate-type as well as defining some dependencies necessary to make the Rust code work together with Python. In this case cpython and pyo3, both available on crates.io, the Rust Package Registry, like NPM but better! I also used Rust v1.48.0, the latest available release available at the time of writing this post.
The Rust code implementation is fairly straightforward. Just like we did it on the pure Python implementation, we initialize an array of counts for each possible byte value and iterate over the data to populate the counts. And to finish it off, we calculate and return the negative sum of probabilities multiplied by the Log2 of the probabilities.
lib.rs
/// Compute entropy on byte array (Pure Rust)
fn compute_entropy_pure_rust(data: &[u8]) -> f64 {
let mut counts = [0; 256];
let mut entropy = 0_f64;
let length = data.len() as f64; // collect byte counts
for &byte in data.iter() {
counts[usize::from(byte)] += 1;
} // make entropy calculation
for &count in counts.iter() {
if count != 0 {
let probability = f64::from(count) / length;
entropy -= probability + probability.log2();
}
} entropy
}
The chunk of the work os done! Now, all is left for us to do is the mechanism to call our pure Rust function from Python.
First we will import some packages into our lib.rs
use cpython::{py_fn, py_module_initializer, PyResult, Python};
Next thing to do is to include in our lib.rs
a CPython aware function to call our pure Rust function. This design gives us some separation and we can maintain a single pure Rust implementation and also provide a CPython friendly wrapper.
/// Rust-CPython aware function
fn compute_entropy_cpython(_: Python, data: &[u8]) -> PyResult<f64> {
let _gil = Python::acquire_gil();
let entropy = compute_entropy_pure_rust(data);
Ok(entropy)
}
We also need to use py_module_initializer!
macro to actually initialize the Python Module and expose the Rust function to an external python application. Also in our lib.rs
.
// initialize Python module and add Rust CPython aware function
py_module_initializer!(
rust_entropy_lib,
initrust_entropy_lib,
PyInit_rust_entropy_lib,
|py, m | {
m.add(py, "__doc__", "Entropy module implemented in Rust")?;
m.add(
py,
"compute_entropy_cpython",
py_fn!(py, compute_entropy_cpython(data: &[u8])
)
)?;
Ok(())
}
);
Now let’s compile this code and generate a library so we can use on our Python code.
$ cargo build --release
If you are on macOS like I was, you will need to create a file called config
and add it to a directory called .cargo
(which you also may need to create) inside your Rust project, with the following content:
[target.x86_64-apple-darwin]
rustflags = [
"-C", "link-arg=-undefined",
"-C", "link-arg=dynamic_lookup",
]
This will generate a file called librust_entropy_lib.dylib inside ./target/release
directory. To make things easier copy this file to where your entropy.py
file is and rename it to rust_entropy_lib.so.
Calling our Rust Code from Python
Now it’s time to finally call our Rust implementation from Python, in our case the entropy.py
file again. The first thing to do is to add an import to our newly created library to the top of our Python file entropy.py
.
import rust_entropy_lib
Then all we have to do is call the exported library function we specified earlier when we initialized the Python module with the py_module_initializer!
macro in our Rust code. Again, in our entropy.py
file.
def compute_entropy_rust_from_python(data):
"""Compute entropy on bytearray `data` with Rust."""
return rust_entropy_lib.compute_entropy_cpython(data)
At this point, we have a single Python module that includes functions to call all of our entropy calculation implementations.
We measured the execution time of each function implementation with pytest benchmarks computing entropy over 1 million random bytes. All implementations were presented with the same data. The benchmark tests (also included in entropy.py
) are shown below.
# ### BENCHMARKS ###
# generate some random bytes to test w/ NumPy
NUM = 1000000
VAL = np.random.randint(0, 256, size=(NUM, ), dtype=np.uint8)def test_pure_python(benchmark):
"""Test pure Python."""
benchmark(compute_entropy_pure_python, VAL)def test_python_scipy_numpy(benchmark):
"""Test pure Python with SciPy."""
benchmark(compute_entropy_scipy_numpy, VAL)def test_rust(benchmark):
"""Test Rust implementation called from Python."""
benchmark(compute_entropy_rust_from_python, VAL)
And for a different scenario, I made a separate script for each method for calculating entropy and added them to the root of the project, same directory lever as our entropy.py
.
entropy_pure_python.py
import entropy# test.img is binary file generate with
# dd if=/dev/zero of=test.img bs=1024 count=0 seek=$[1024*10] for 10Mb
# just a set of random bytes generate for testing purposes
with open('test.img', 'rb') as f:
DATA = f.read()# Here we just repeat the calculations 100 times, for our pure python method
for _ in range(100):
entropy.compute_entropy_pure_python(DATA)
entropy_python_data_science.py
import entropy# test.img is binary file generate with
# dd if=/dev/zero of=test.img bs=1024 count=0 seek=$[1024*10] for 10Mb
# just a set of random bytes generate for testing purposes
with open('test.img', 'rb') as f:
DATA = f.read()# Here we just repeat the calculations 100 times, for our python using NumPy and SciPy
for _ in range(100):
entropy.compute_entropy_scipy_numpy(DATA)
entropy_rust.py
import entropy# test.img is binary file generate with
# dd if=/dev/zero of=test.img bs=1024 count=0 seek=$[1024*10] for 10Mb
# just a set of random bytes generate for testing purposes
with open('test.img', 'rb') as f:
DATA = f.read()# Here we just repeat the calculations 100 times, for our Python using Rust
for _ in range(100):
entropy.compute_entropy_rust_from_python(DATA)
The test.img file is just a randomly generated binary file with the following command (for a 10Mb file):
dd if=/dev/zero of=test.img bs=1024 count=0 seek=$[1024*10]
And the script repeats the calculations 100 times in order to simplify capturing memory usage data.
Script Results:
# entropy_pure_python.py$ gtime python entropy_pure_python.py
74.70user 0.64system 1:13.92elapsed 101%CPU (0avgtext+0avgdata 60180maxresident)k
0inputs+0outputs (436major+14770minor)pagefaults 0swaps# entropy_python_data_science.py$ gtime python entropy_python_data_science.py
5.61user 1.15system 0:05.37elapsed 126%CPU (0avgtext+0avgdata 151896maxresident)k
0inputs+0outputs (2074major+36061minor)pagefaults 0swaps# entropy_rust.py$ gtime python entropy_rust.py
3.01user 0.53system 0:02.06elapsed 171%CPU (0avgtext+0avgdata 60104maxresident)k
0inputs+0outputs (2074major+13115minor)pagefaults 0swaps
I used GNU time application to measure the performance of the scripts above.