“Cheese” and Smiles
My introduction to face attribute classification came from Christian Moeller’s Cheese (2003), where a sequence of actresses try to hold a smile as long as they can while being judged by a computer on the quality of their smile. Christian writes “The performance of sincerity is hard work.”
Face attribute classification stands in contrast to the more familiar technologies of face detection (drawing a box around a face), face landmark detection (finding key points like eyes, ears, or lips), and face recognition (matching a name to a face, but sometimes used as a catch-all term for all face analysis).
Identifying whether someone is smiling or not is treated as one “attribute”: a class, label, trait, or description of a face. A single, irreducible dimension, a binary: SMILING. But no attribute exists in isolation. And many, like facial expressions, have complex histories. Modern face expression classification has its origins in the late 60s, when American psychologist Paul Ekman developed the Facial Action Coding System. He broke down expressions into their constituent muscle movements, and tied them to emotions based on his personal intuition and a now refuted hypothesis that emotions are expressed the same way across all cultures.
Facework starts with SMILING as the first job because it is something most people can perform easily. But it’s also one of the first face attributes I personally worked with. My earliest work on face analysis was building Portrait Machine (2009) with Theo Watson. We analyzed everything we could: hair color, skin color, clothing color, face size, head tilt, sunglasses, etc. It was a very manual process full of heuristics, like color averages and carefully set thresholds. We had to write new code for every new attribute. Something like a smile seemed impenetrable.
But with new machine learning techniques it is possible to make a prediction based on “examples” (training data) instead of “explanations” (hard-coded heuristics). In 2010 Theo found some machine learning-based smile detection code from the Machine Perception Laboratory at UCSD (the same group who worked with Christian on Cheese). Theo built an app that inserts a smile emoji whenever you smile. I extended Theo’s work to share a screenshot whenever you smile.
Detecting smiles can feel innocuous and playful. But as Cheese shows: try holding that smile for more than a few moments, and it’s apparent that something is very wrong. Why are we asking a computer to judge us in the first place? And who is putting that judgement to use? Does the meaning of an expression change when we’re being judged by a machine instead of another human?
“us+” and “Vibe Check”
In 2013 I worked with Lauren McCarthy to build us+, a video chat plugin that analyzes speech and facial expressions and gives feedback to “improve” conversations. How might the future feel if every expression, every utterance, was automatically analyzed? Can it be affirming or helpful to get automated advice, or is this future nothing but a dystopia to avoid? Are there some things that only humans should ever judge, and machines should always avoid?
In 2020 we created Vibe Check, a face recognition and expression classification system to catalog the emotional effect that exhibition visitors have on one another. Some visitors are identified as consistently evoking expressions of happiness, disgust, sadness, surprise, or boredom in others nearby. Upon entering the exhibition, visitors are alerted to who these people are, and as they leave, they may find they’ve earned this distinction themselves and found a place on the leaderboard.
With us+ and Vibe Check we are not expressing broad pessimism in response to this tech. Maybe we can find some place for it, if we build it consensually and reflect on it together? Maybe there is someone out there who could use the feedback from a machine instead of a human, or someone who needs help identifying face expressions? We try to create a space for that reflection.
Without reflection, we get companies like HireVue, who sell software that analyzes speech and facial expressions to estimate job interview quality, leading to complaints to the Federal Trade Commission.
In late 2019 Microsoft researcher Kate Crawford and artist Trevor Paglen published Excavating AI. Their essay examines the impossibility of classification itself: typically framed as a technical problem, they break it down to its political and social foundations.
This essay was preceded by “an experiment in classification” called ImageNet Roulette, developed by Leif Ryge for Trevor’s studio: upload a photo to the website, and it returns a copy with a box around you and a label. They write:
The ImageNet dataset is typically used for object recognition. But […] we were interested to see what would happen if we trained an AI model exclusively on its “person” categories. […] ImageNet contains a number of problematic, offensive, and bizarre categories. Hence, the results ImageNet Roulette returns often draw upon those categories. That is by design: we want to shed light on what happens when technical systems are trained using problematic training data.
In some ways this artistic “what would happen if” process mirrors the way that some face classification research is carried out:
Most or all of this work hopes to create a teachable moment: Kate and Trevor wanted to show the dangers of “problematic training data,” and classification more broadly, Michal and Yulin wanted to “expose a threat to the privacy and safety of gay men and women,” Xiaolin and Xi wrote a defense of their work saying they believe in the “importance of policing AI research for the general good of the society.”
When does the opportunity for discussion for some come at the expense of retraumatizing others (“The viral selfie app ImageNet Roulette seemed fun — until it called me a racist slur”)? When commenting on the potential dangers of technology, when is the comment outweighed by the potential for the work to be misused or misinterpreted?
Shu Lea Cheang pushes back differently in her work 3x3x6 (2019) where visitors to the installation send selfies to be “transformed by a computational system designed to trans-gender and trans-racialize facial data.” Instead of simply handing visitors a reflection of the machine’s gaze, the machine is actively repurposed to blur the categories into a queered heterogenous mass. Paul Preciado connects Shu Lea’s work back to Michal and Yulin’s research:
But if machine vision can guess sexual orientation it is not because sexual identity is a natural feature to be read. It is because the machine works with the same visual and epistemological regime that constructs the differences between heterosexuality and homosexuality: We are neither homosexual nor heterosexual but our visual epistemologies are; we are neither white nor black but we are teaching our machines the language of technopatriarchal binarism and racism.
Most of the jobs in Facework are trained using a dataset called Labeled Faces in the Wild Attributes+ by Ziwei Liu, et al (with the exception of WEARING MASK, POLICE and CEO). LFWA+ is not as well known as the larger CelebFaces Attributes Dataset (CelebA) though both were released with the paper Deep Learning Face Attributes in the Wild (2015). While CelebA consists of 200k images of celebrities with 40 attributes per image, LFWA+ is a smaller 18k images with 73 attributes each. Both include BIG LIPS, BUSHY EYEBROWS, and DOUBLE CHIN, and others. LFWA+ adds four racial groups, SUNGLASSES, CURLY HAIR, and more.
The logic or ontology of these categories is unclear, and the provenance of these labels is hard to trace. The accompanying paper only mentions briefly how the LFWA+ data was created. Based on my sleuthing, it seems to be related to an earlier dataset called FaceTracer. My impression is that the researchers were just trying to plug in anything they could get their hands on. One lesson seems to be: if the data exists, someone will find a use for it — appropriate or not. Some researchers critiqued ImageNet Roulette as a completely inappropriate use of the ImageNet labels, saying that no one has ever trained a classifier on the “person” category. But appropriating data regardless of origin or original intent seems to be pretty standard in machine learning. What classifier will be trained next? Here are some other labels I have seen describing people in datasets:
HEIGHT, WEIGHT, EYE COLOR, HAIR COLOR, BUST SIZE, WAIST SIZE. BLACK, WHITE, ASIAN, INDIAN. EUROPEAN, AMERICAN, AFRICAN, CHINESE. AMERICAN INDIAN OR PACIFIC ISLANDER, ASIAN OR PACIFIC ISLANDER. SPORTS COACH TRAINER, PHOTOGRAPHER, VOTER, ANGLER, PARACHUTIST, BULLFIGHTER, JOCKEY, LABORER, SURGEON, WAITRESS, STUDENT, SPORTS FAN. SOLDIER FIRING, SOLDIER PATROLLING, SOLDIER DRILLING. FAMOUS. FAMILY. ATYPICAL, AGGRESSIVE, BORING, CARING, COLD, CONFIDENT, EGOTISTIC, EMOTIONALLY UNSTABLE, FORGETTABLE, HUMBLE, INTELLIGENT, INTROVERTED, KIND, MEMORABLE, RESPONSIBLE, TRUSTWORTHY, UNFAMILIAR, UNFRIENDLY, WEIRD. NORMAL.
Mushon Zer-Aviv tackles this final category of “Normal” in the The Normalizing Machine (2019) where visitors are asked to point at a previous visitor’s portrait that looks “most normal.” Mary Flanagan address the absurdity of these labels in [help me know the truth] (2016) where visitors are asked to pick between two random variations of their own selfie that exemplify a ridiculous label like “Which is a banana?” or “Which is a martyr?”
In response to these and other labels I think of critic and writer Nora Khan, who encourages us to ask the same questions about “naming” that we ask in the arts:
Is what I’m seeing justifiably named this way? What frame has it been given? Who decided on this frame? What reasons do they have to frame it this way? Is their frame valid, and why? What assumptions about this subject are they relying upon? What interest does this naming serve?
In order to understand this research better, I try to replicate it. It’s easy to critique face analysis systems as fundamentally flawed for collecting images non-consensually, or for trying to fit the boundless potpourri of human expression into Paul Ekman’s seven categories. But when I train these systems I get to discover all the other flaws: the messy details that the researchers don’t write about.¹
Replicating research gives me insight that I can’t get from using a toolkit, reading a paper, or studying historical precedents and theory. It helps put me in the mindset of a researcher. I have to solve some of the same problems, sit with an assumption of the validity of the categories and data in question for an extended duration. I get to let the data stare back at me. Every dataset has a first row, and there are always one or two faces I see over and over while debugging code.
“Who Goes There”
Some face attributes seem mundane: BANGS, SHINY SKIN, TEETH NOT VISIBLE. Others are a matter of life and death: like research on classifying Uyghur people which is sold by surveillance companies for use in western China where one million Uyghur people are sent for “re-education” each year.
I think this is one reason that some of the most detailed work on predicting race is framed as a completely different problem: “photo geo-localization.” In Who Goes There (2016) by Zach Bessinger, et al. they build on previous work GeoFaces (2015) by Nathan Jacobs, et al. to predict the geographic location of a photo based on the faces in that photo. This is a common sleight of hand: researchers rarely have the data they want, so they find a proxy. If they can’t find a large collection of photos with self-identified racial identities for each face, they use the GPS tags from 2 million Flickr photos.
In replicating Zach’s work I discovered how this sleight of hand can have subtle unexpected outcomes. After achieving similar accuracy to the original model, I went through the data to find examples of photos that looked “most representative” of each of the geographic subregions. I was curious who the model predicted to be “the most North American” or “the most East Asian,” and if this matched my personal biases or any other stereotypes. This is how I found JB.
For most regions of Africa, the model picked a variety of people with darker skin as being most representative. But in Central Africa, a massive region including around 170M people, it picked dozens of photos of only one guy with much lighter skin. I thought it was a bug at first, but I couldn’t find the mistake in my code. So I traced the faces back to Flickr and found him.
I’m JB and I completed a 26600 KM trip across Africa via the West coast, from Zürich to Cape Town, between 2012 and 2014. This website relates this wonderful and strenuous experience in geotagged posts, thousands of photos…
To the classifier, JB’s face was so consistent and useful for prediction that it sort of fell in love and prioritized him above anyone else in this area. No one else was as similar looking to other folks in the region as JB looks similar to himself. If we zoom in to the researcher’s map we can actually find JB around Gabon.
Even after making a small change to only allow one example per face, I saw new bugs: hundreds of people photographing the same celebrity, or statue, or artwork.
Discovering JB helped me think about how we use proxies in research and daily life. In Who Goes There, appearance is a proxy for geolocation (or geolocation is a proxy for race). In everyday interaction, expression is a proxy for emotion. There is a place for some proxies: they can be an opportunity to communicate, to understand, to find your people. But they can also be a mechanism for prejudice and bias: gender-normative beliefs might treat clothing as a proxy for gender identity, or racist beliefs might treat skin color as a proxy for racial identity. In the end there is no substitute for someone telling you who they are. And no machine learning system can predict something that can only be intentionally communicated.
Facework is massively inspired by face filters and face-responsive apps. This includes popular face swap apps like Face Swap Live and Snapchat from 2015, which can be traced to work I made with Arturo Castro in 2011 (and to earlier experiments by Parag Mital). But Facework is more directly connected to face filters that change your age and gender, your “hotness,” or your race. The same datasets and algorithms that drive face attribute classification also drive face attribute modification.
So it makes sense that when researchers collect expressions, accessories, age, gender, and race under one generic umbrella as “face attributes,” that this uniform understanding of facial appearance is transferred to face filter apps. When the dataset treats SMILING, OLD, and ASIAN the same, face filter apps also treat them the same. FaceApp might filter someone’s face to make them look like they are smiling, or older, and not think twice about rolling out race-swapping filters.²
This is connected to a more general dehumanizing side effect of face analysis. Édouard Glissant writes that “understanding” someone is often based on making comparisons: either to an “ideal scale” or an analysis of differences without hierarchy. He calls this “transparency,” where each person is rendered “transparent” to the observer. The observer “sees through” the person to a system of classification and comparison. Édouard asks for “opacity”: to avoid reduction and understand each person as “an irreducible singularity.” But transparency is the essence of automated face analysis, and dehumanization is a natural consequence.
Dehumanization shows up in the terminology. The same way prison guards refer to prisoners as “bodies”, cops refer to suspects in their surveillance systems as “objects”, and popular face analysis libraries call aligned face photos “face chips.” The images no longer represent people, but are treated as more similar to color chips.
Dehumanization also manifests in privacy practices. The most popular face filter apps have been caught harvesting images from users, like Meitu in 2017 or FaceApp in 2019. Using face-responsive interaction as bait for collecting images goes back to nearly the beginning of automated face analysis itself: in 1970 Toshiyuki Sakai collected hundreds of faces from an installation showing visitors which celebrity they looked like, Computer Physiognomy at the Osaka Expo. When each face is seen as a replaceable and interchangeable variation on every other face, and large amounts of data are required to build and verify these systems, it makes perfect sense to surreptitiously build massive collections of faces. When the United States Customs and Border Patrol loses 184,000 images of travels, we may find out about it. But it is unclear whether any face filter apps have ever had a similar breach, or whether Meitu’s data has been turned over to the Chinese government.