An Attack on Deep Learning

Yes, but hear me out!

Credit: Pixabay

Deep learning has become ubiquitous with data science and an inextricable element of machine learning. It’s shaped how humans interact with machines perhaps more so than any other advance in mathematical modeling to date. With self-driving cars, computers triumphing over masters at their respective table games, and language translation accessible via any smartphone across the globe, it’s natural to wonder — what can’t deep learning do?

Deep learning has turned our world upside down; but along the way, it’s inferred and amplified societal biases lurking in the truly massive datasets necessary to train such models. Correspondingly, criticisms of algorithmic bias therein have resulted in highly publicized employee dismissal(s). The primary call to action present said criticisms has been variations of:

The model, the data, and the underlying biases therein cannot be disentangled and this information ought not to be overlooked in the name of deep learning product adoption. [My own summary.]

Researchers, critical of state of the art model performance on benchmark question/answering datasets, noted that performance dropped by over 20% if small changes to Q/A structure were induced; in other words, spurious associations between words enabled the model’s incredible performance, not a genuine understanding of language. And this fact frames a central issue in deep learning research: Advances and achievements in deep learning are measured by benchmark dataset performance — and there’s a twofold financial incentive to continue benchmark driven research & development: (A) It’s very costly to create a dataset for training of (supervised) deep learning models and (B) benchmark dataset performance is central to AI’s value proposition — it articulates progress, which means monetary value, which means affect on <insert big tech company>’s valuation. (I’ll circle back later on one more reason why big tech is incentivized to propagate widespread deep learning enthusiasm.)

Deep learning is effectively black-box pattern matching. The models easily overfit on small datasets, but generalize well as the sample size approaches the population size. Deep learning models don’t think (despite the name), they simply map inputs to outputs: Faces in photos, subjects in sentences, etc. The model architectures change every couple years and don’t receive rigorous mathematical proofs to validate why such a design is the optimal configuration; rather, they’re updated in attempts to squeeze out a meaningful nudge to some performance metric of user choice. We often call such an approach an heuristic method, meaning:

any approach to problem solving … that employs a practical method that is not guaranteed to be optimal, perfect, or rational, but is nevertheless sufficient for reaching an immediate, short-term goal or approximation.

Heuristic methods are a wise-choice when it’s infeasible to account for (or even approximate) the full complexity of a system, especially when we have no hypotheses on how the system behaves. As a heuristic method, deep learning performs exceptionally well on unstructured data tasks, such as facial recognition, because we simply don’t have a means to articulate every conceivable pattern of pixel configurations that could define what a face is or isn’t. In such a context, a highly flexible, black-box pattern matching machine is very desirable.

However, there are countless problems where we do have the means to propose hypotheses about how the system works and we need not only accurate results but also confidence in our predictions and enhanced understanding of how the system works. For example, a hierarchical/multilevel Bayesian model, developed by Google researchers in 2017, can be used to understand the effect of marketing channel spend on sales to include — the delay until the investment has peak effect, the saturation point where further investments results in diminishing returns, and the strength of per-channel investment on sales. A deep learning model, by contrast, can only make predictions; we can’t take it’s five million parameters and walk away with increased understanding of how the system actually behaves.

The blessing and curse of Bayesian methods is that they put the user in the driver’s seat. You have to supply beliefs about how your system works via parameters and probability distributions that describe them. In other words, you fit the model to the data. By contrast, black-box methods allow the user to squeeze the data into a shape that’s appropriate for a well-studied task, such as classification, regression, or clustering. It’s true that a knowledgeable user of neural network libraries (PyTorch, TensorFlow/Keras, etc.) will be able to mold a deep learning architecture specific to the task at hand; however, the effectiveness of such an architecture is judged entirely by performance and adjustments therein are accomplished by tweaking the nuts and bolts of the machine until an arbitrary satisfaction threshold is reached; in other words, these adjustments are not informed by knowledge of the system you seek to model.

This brings us to the real value proposition of cloud-based machine learning: Both Google and Amazon have their respective out of the box, cloud based, deep learning driven, machine learning products. Amazon Web Services’ SageMaker allows for plug & play machine algorithms, such as object detection, without imposing any mathematical knowledge requirements on the user. The justification of your adoption of said product(s) is in their truly remarkable accuracy on benchmark tasks. However, if you need to improve your human understanding of how your system works, these tools will be of little help to you. Thus tech giants are incentivized to propagate the belief that deep learning is perfect for any conceivable task because it facilitates product adoption. (I promised I’d circle back to this.)

An issue I’ve run into when working with business domain experts is watching their expectations deflate when hit with the realization that deep learning is only exceptionally good at benchmark tasks, and simply pretty good on other tasks when the data available is very large. For example, state of the art named entity recognition (NER) models can detect people, places, organizations, dates, currencies (and often more) in text. However, that doesn’t necessarily guarantee that extracting skills from resumes will be a trivial task — you still need a large volume of training data, and this is costly to acquire. The problem is so well-established that Amazon Mechanical Turk crowdsources the data tagging (training dataset preparation process) for you — which relies on human labor where affordable.

A common mitigation to the novel task problem for deep learning is in the usage of transfer learning. In the context of natural language processing, this takes the form of word embeddings. Think of transfer learning as a coupon at a restaurant — “free wings with the order of one any-size pizza.” The wings are an appetizer, filling you up a bit so you can get away with only ordering a small or medium sized pizza, instead of a large (or extra larger.) You’re still paying for a pizza — it just doesn’t cost as much. The data ingestion process is like the pizza; sure you need less, but you certainly still need some. So instead of millions of observations, you might get away with hundreds of thousands of rows, instead, thanks to your transfer learning (buffalo wings in this metaphor.) It’s not realistic to tag the skills in a hundred resumes over one afternoon, or even a couple thousand over a week, and expect transfer learning to do the rest.

One exception I’ve omitted is reinforcement learning — the juice behind self driving cars and table game mastery. This form of machine learning (and in these two examples, deep learning) does not require a training dataset. However, it does require an environment, which can reward/punish actions taken by an agent. However, a business systems framed as a game, appropriate for reinforcement learning, is formally referred to as Markov Decision Process. And this does demand some mathematical knowledge of the user (it’s just as complex as Bayesian methods) and isn’t likely to become an out of the box offering by cloud based machine learning providers in the near term future.

A second exception I’ve omitted is Bayesian Neural Networks. This design allows for the incorporation of prior beliefs, which in the deep learning context, regularizes the values that given parameters can take on. Correspondingly, the method is less likely to overfit on small data volumes (a serious improvement), and will most importantly, return distributions over its predictions — returning not just what it believes but its confidence therein, too. But this is a blessing and curse — the larger the network, the longer the training time necessary for typical Bayesian training methods — MCMC sampling. The issue is so troublesome that a new learning algorithm is in active research as we speak: Variational Inference. This method seeks to initialize a set of samples and iteratively shrink the entropy between this synthetic distribution and the target distribution. Various tools in the Bayesian Statistics community have an implementation available, such as PyMC3. However, it hasn’t been studied nearly as long as MCMC sampling — so we’re not sure what pitfalls its susceptible to, how to mitigate these risks, etc. With these constraints in mind, it takes considerable Bayesian Neural Network study effort on the behalf of the user to gainfully use one in practice. As a consequence, they’re not likely to appear in out of the box, cloud ML offerings in the near-term future.

Speaking of Bayesian methods, anytime you need to understand how your system actually behaves, your go-to should be Bayesian Statistics. These days, high level APIs, like PyMC3, abstract away a lot of the low level details, while allowing you to retain control over what your parameters actually model. I highly recommend the book, Statistical Rethinking by Richard McElreath 2nd edition, the corresponding YouTube lecture series, and the book’s code, ported to PyMC3, on GitHub.

I don’t mean to bemoan cloud ML solutions, in general, but to cast doubt on (A) assumptions of widespread applicability of benchmark DL tasks and (B) the ease of ingesting enough data for DL performance on a novel task. Cloud ML offerings, such as AWS, enable you to upload a docker image, complete with the specific library requirements for your tasks, greatly extending your modeling choices beyond benchmark (DL) tasks.

In summary, deep learning is incredible at well researched tasks due to the enormous benchmark datasets used; however, critics of algorithmic bias advocate for diverse research teams reviewing the data ingestion process, which adds cost to an already costly process. And performances expectations for less-researched tasks need to be calibrated with the very real expectation that data ingestion is very costly and slow, perhaps so much so that we shouldn’t rule out methods that don’t alienate the human being from the model. In other words, you haven’t been sold a lie, but you have been sold a slick marketing campaign, singing to the strengths but muting the weaknesses of deep learning as a solution to all your data needs.

I’ll leave you with Plato’s caution against writing:

They will cease to exercise memory because they rely on that which is written, calling things to remembrance no longer from within themselves, but by means of external marks.

In short — we don’t write to remember; we write to forget. The invention of written language had an unintentional consequence — by extending our access to information, we accidentally removed ourselves of the obligation to remember it. Of course, written language has unambiguously improved society; nonetheless, written language had a profound consequence on Grecian oral tradition. This example illustrates that changes in how we encode information have hidden consequences, affecting our ability to access and manipulate that very information.

Certainly, self-driving cars and language translation models justify the extensive research deep learning has received. But with widespread adoption of the belief that any problem can be solved by deep learning, we are inadvertently curbing the value of understanding the systems we wish to model and the powerful techniques therein.

Yes, but hear me out!

Footer