Multiprocessing and Pickle, How to Easily fix that?

In real life situation, you might need high computational power to execute some tasks. A great example is to train the machine learning model or neural network, which are intensive and time-consuming processes.

If you are interested to read more about multiprocessing, Brendan Fortuner wrote a great article about threads and processes in Python. You can check it out here

One of the routes you might consider is distributing the training task over several processes utilizing the pathos fork from python’s multiprocessing module. However, the multiprocess tasks can’t be pickled; it would raise an error failing to pickle. That’s because when dividing a single task over multiprocess, these might need to share data; however, it doesn’t share memory space.

In this situation, the dill package comes in handy, where it can serialize many types of objects that aren’t pickleable. It can serialize database connections, lambda functions, running threads, and more.

dill is slower typically, but that’s the penalty you pay for more robust serialization. If you are serializing a lot of classes and functions, then you might want to try one of the dill variants in dill.settings . If you use byref=True then dill will pickle several objects by reference (which is faster then the default). Other settings trade off picklibility for speed in selected objects. — Mike McKerns (dill author answer on Stackoverflow dill vs. cPickle)

To make sense of it, let’s have an example.

>>> import dill
>>> from pathos.multiprocessing import ProcessingPool
# set the processing pool to 4 core
>>> pool = ProcessingPool(nodes=4)# do some processing
>>> result = pool.map(lambda x: x**2, range(10))# save the method as pickle using dill 
>>> dill.dump(lambda x: x**2, open('use_dill', 'wb'))# save the session as pickle file 
>>> dill.dump_session('dill_session.pkl')

The dump_session() used to serialize the entire session of the interpreter.

Now, restart the python terminal and investigate the global environment:

>>> globals().items()

It should return something similar to the following, which is the interpreter initial state

dict_items([('__name__', '__main__'), ('__doc__', None), ('__package__', None), ('__loader__', <class '_frozen_importlib.BuiltinImporter'>), ('__spec__', None), ('__annotations__', {}), ('__builtins__', <module 'builtins' (built-in)>)])

Next, try to reload the dill session you saved earlier by:

>>> import dill 
>>> dill.load_session('dill_session.pkl')
>>> globals().items()

The output should be:

dict_items([('__name__', '__main__'), ('__doc__', None), ('__package__', None), ('__loader__', <class '_frozen_importlib.BuiltinImporter'>), ('__spec__', None), ('__annotations__', {}), ('__builtins__', <module 'builtins' (built-in)>), ('dill', <module 'dill' from '/Users/salmaelshahawy/.pyenv/versions/3.8.6/envs/venv38/lib/python3.8/site-packages/dill/__init__.py'>), ('ProcessingPool', <class 'pathos.multiprocessing.ProcessPool'>), ('pool', <pool ProcessPool(ncpus=4)>), ('result', [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]), ('__warningregistry__', {'version': 0})])

retrieve the interpreter session using dill — image by the author

Footer