1.1 — Combating Bias at Pinterest
Machine learning powers many advanced search and recommendation systems, and user experience strongly depends on how well these systems perform across all data segments. This performance can be impacted by biases, leading to a subpar experience for subsets of users, content providers, applications, or use cases. In her talk “Inclusive Search and Recommendations,” Nadia Fawaz describes sources of bias in machine learning technology, why addressing bias matters, and techniques to mitigate bias, with examples from her work on inclusive AI at Pinterest.
With 442 million global monthly-active-users, 240 billion pins saved, and 5 billion boards in 30 languages, Pinterest has an amazing dataset for search and recommendations tasks. The most basic task is to predict the likelihood that a pinner would interact with a pin — given the search queries, pinner features, pin features, and past pinner interactions with pins/boards. However, it’s not always easy to find the most relevant results. The majority of queries on Pinterest are less than three words, which presents an interesting serving challenge. Besides, their current ranking algorithms are heavily influenced by what most people have engaged with over time. This means that some pinners have had to work harder to find what they were looking for. To build a more inclusive search experience, the R&D team at Pinterest defines inclusive AI’s key pillars, starting with analyzing bias at all development stages.
Here are various bias sources in machine learning that Nadia defined:
- Societal bias is inherently present in the data — due to many diversity dimensions such as demographic, geographic, cultural, application-specific, implicit, etc.
- Data collection bias entails serving bias, position bias, summary-based presentation bias, and repetitiveness bias.
- Modeling bias includes statistical under-fitting (too simple models with few parameters and insufficient features), model fairness (disparities in performance across groups), training objective (aggregate loss function may favor the majority class, hard to differentiate between error types, only focus on a single utility), model structure (bias from model sub-components) and coarse model tuning (single or group thresholds are not robust enough).
- Offline evaluation bias happens with evaluation data (imbalanced classes, biased labeling, static features) and evaluation metrics (coarse overall aggregates, accuracy might favor the majority class and hide performance for under-represented classes, disparities are not evaluated).
- Experimentation bias occurs during A/B testing: your treatment doesn’t have the same effect on un-engaged users as it does on engaged users, but the engaged users are the ones who show up first and therefore dominate the early experimental results. If you trust the short-term results without accounting for and trying to mitigate this bias, you risk being trapped in the present: building a product for the users you have already activated instead of the users you want to activate in the future.
It is desirable for Pinterest to reduce bias in their machine learning applications due to societal and legal requirements, user-centric mission, and the high standard for technical craftsmanship. Here are the variety of techniques that Pinterest applies:
- Randomization at data collection: For the top-k recommendation task, they used the explore-exploit strategy. During exploration, the model selects items with high predicted scores. During exploitation, the model collects feedback on items with lower predicted scores by randomly selecting item distributions.
- Diversity re-ranking at serving: They intentionally boosted the pin scores for deeper skin-tones. For models with multiple stages, they performed boosting both at the light-weight ranking and the full-ranking layers. This method is simple but requires manual tuning and multiple post-processors’ interactions. Besides boosting, they also attempted fairness-aware re-ranking via greedy and dynamic programming algorithms that can self-adjust the recommendations to meet the target diversity distribution in top-k results.
- Data augmentation: At the data collection stage, they generated synthetic data, performed negative sampling, and ensured diverse manual data labeling. At the modeling stage, they used techniques like SMOTE and IPS for sampling, revised the model via error analysis / architectural changes / ensembling methods, and augmented the training objective with fairness and diversity constraints.
- Offline evaluation: During the evaluation phase, they ensured that the test dataset has good enough coverage. They looked at objective function beyond aggregates (by incorporating fairness notions to quantify disparities) and evaluation metrics beyond accuracy (by doing error analysis for each class label). Finally, they also experimented with open-source monitoring tools, brought humans into the loop, measured models in production, and re-evaluated their performance over time.
A specific example that Nadia brought up in her talk on how Pinterest mitigates bias is their skin-tone models (check out this post for more details). These are closed-box deep learning models for face detection that can extract the dominant color and threshold lightness range. Such utilities give users control, respect their privacy, improve their experience in deeper skin-tone, and increase engagement with diverse content. During the offline evaluation, the Pinterest team designed various strategies to quantify bias and error patterns: labeling a high-quality, diverse golden dataset, using a confusion matrix to analyze error patterns, and choosing granular metrics per sensitive attribute + fairness metrics to quantify disparities. After several iterations of the model, they also augmented the data (by including body parts, partial faces, and men’s fashion) and created multi-task visual embeddings (fashion classification, beauty detection, internationalization), which contributes to the final skin-tone classification task. Here are specific takeaways that Nadia mentioned:
- Start with diverse data.
- Bias can come from modeling choices and the evaluation process.
- Test and test again the Machine Learning system at every step.
- Quantify bias and analyze error patterns at granular attribute level for fast iteration.
- Learn from errors, and make your Machine Learning models learn too.
- Build iteratively to uncover complexity layers: first get the simple case right, then master production scale, and finally expand to harder cases and more product surfaces.
- Overall accuracy is not synonymous with fairness. It’s crucial to manage biases and improve all metrics for all skin-tones proactively.
Nadia ended her talk with concluding remarks about the benefits of Pinterest’s Inclusive AI approaches. Firstly, they improve user representation and content provider exposure. Secondly, they help Pinterest understand and increase content diversity. Thirdly, they mitigate bias in machine learning models for embedding, retrieval, and ranking tasks. Finally, they enable Pinterest to grow an inclusive product globally. It is noted that inclusive AI is a challenge that goes beyond engineering. This challenge requires contributions from multi-disciplinary teams (product, inclusion and diversity, legal and ethics, communities and society feedback) to effectively model, measure, and address AI bias.
1.2 — Design Patterns for Recommendations at Twitch and Twitter
Building recommendation systems in production that can serve millions of customers goes way beyond just having a great algorithm. The scale of users, size of the catalog, and speed of reaction to user actions make such systems very challenging to build. A set of co-operating systems need to be built that can serve the needs of the users. In his talk “Key Design Patterns For Building Recommendation Systems At Scale,” Ashish Bansal distilled learnings from building large-scale recommendation systems across companies like Twitch and Twitter into a set of commonly used design patterns.
He started with a few motivating examples:
- In a system like Twitter, there are a variety of recommendation use cases: other users to follow (billion of items with months-to-years shelf life), relevant tweets (hundreds of millions of items with hours-long-shelf life), and events/trends (hundreds of thousands of items with hours-long shelf life).
- In a system like Netflix, hundreds of thousands of movies are recommended with years-to-decades shelf life.
- In a system like Amazon, hundreds of millions of products are recommended with months-to-years shelf life.
If we classify the recommendation system by item volume and velocity, then four system type patterns stand out (as seen above):
- Few-Short Pattern: These systems capture real-time features and serve real-time inference. Session-based recommendations may be useful.
- Few-Long Pattern: This is the best spot to be in. Common approaches include end-to-end deep learning and matrix factorization to capture long-liver item and user embeddings, batch pre-computation of the similarity scores, cache serving, etc.
- Big-Short Pattern: A tough space. Real-time features make a huge difference, so common approaches include two-stage architectures (candidate generation + blender/ranker) and approximate nearest neighbor algorithms.
- Big-Long Pattern: Here, the complexity lies in managing large models and large user/item data. All the techniques from the few-long pattern will work in this case.
Ashish then illustrated the two-stage architectural pattern with this nice diagram above:
- In stage 1, the candidate generation layers filter out hundreds of items from millions of items. This step must be quick (in a few milliseconds). Tools like ElasticSearch are great examples, which pre-compute the similarity between items (using metrics like cosine distance),
- In stage 2, the blending layer combines candidates from different sources, scores the candidates based on a utility function, and ranks them based on the scores. The blender could be rule-based, could be a model estimator, or could take in query parameters from the users.
He next discussed the three typical recommendation types:
- User-User modeling involves measuring the similarity between users (useful for cold-issues). Neighborhood-based methods based on user attributes would suffice.
- User-Item modeling is the most common pattern. We use matrix factorization, factorization machines, or deep models to capture the user-item interactions directly.
- Item-Item modeling entails association rules and market-basket analysis. Neighborhood-based methods based on item attributes would suffice.
Ashish concluded the talk with a couple of supporting system patterns that can help with recommendations at scale:
- Impression store can track served recommendations. It is important to answer beforehand what constitutes an impression.
- Feedback store can capture user feedback on the relevance of recommendations. It can be used as a filter for irrelevant items.
- Explicit interests store allows users to guide recommendations. It can be challenging to incorporate such a mechanism into the model.
- Session tracker tracks impressions to clicks and actions. It can be challenging due to the distance between impressions and other actions (replies, for example).
- Label generator logs training data for future models in production. However, there is often a large imbalance between positive and negative labels, which may require re-weighting data samples.
1.3 — Moderating Comments at The Washington Post
Patrick Cullen and Ling Jiang gave an informed talk about raising the quality of online conversations with machine learning at The Washington Post (WP). In particular, they shared how they built a system for automatically moderating comments from millions of reader comments.
At WP, comments provide a way for journalists to speak directly to the readers and build a sense of community, as readers share their views on important topics. The commenters are often the most active and engaged readers. However, trolls, bots, and incivility lower the quality of online conversations. Moderating the quality of conversations is a logical next step. WP has more than 2 million comments a month, so relying on human moderators is cost-prohibited. ModBot is an application that combines machine learning with human moderators to moderate the quality of conversations that can scale to millions of comments.
As depicted in the diagram above, the ModBot API takes as input the comment and outputs a number between 0 and 1. Scores close to 0 indicate that the comment should be approved, and scores close to 1 indicate that the comment should be deleted from the site because it violates WP’s community guidelines.
ModBot includes a pre-filter, which is a rules-based system that identifies comments, including banned words. If the pre-filer passes the input, the machine learning classifier scores the comment and returns the score and moderation decision in the API call.
To train ModBot, the data science team at WP built a classifier that learns from training data using NLP techniques. They collected over 60,000 comments that are human-labeled. As noted above, deleted comments often have many offensive words, many hyper-links, and special symbols. Approved comments, on the other hand, use neutral or positive words with substantial length. After being trained, ModBot can differentiate good comments from bad ones. Then, they use this trained model to predict new unseen comments.
They ran several different models using bag-of-words features with 10-fold cross-validation on an imbalanced dataset (70% approved and 30% deleted). These models include Logistic Regression, Support Vector Machines, Random Forest, Decision Trees, and Naive Bayes. After initial experiments, they found that Logistic Regression and Support Vector Machines performed better than the others.
Then, they engineered more features that can be predictive, including sentence count, word count, link count, email count, and special characters count. They also experimented with Convolutional Neural Networks and Recurrent Neural Networks, which both outperformed the linear models but are expensive and non-interpretable. Eventually, they settled with a Support Vector Machine model in production due to its simplicity and explainability.
Another fascinating aspect of this application is the “human-in-the-loop” component, both during training and during inference. In the comment above, ModBot suggested deleting it due to the word “idiot.” However, the human moderator approved the comment because the post allows criticism of public officials. To handle this common issue, the data science team added a named entity filtering layer to pre-process the comments that involve public figures.
In general, handling mislabeled training data requires an iterative process that includes the comments, ModBot API, the predictions, and the human reviewer. After ModBot predicts on comments, the human reviewer can modify the label. The revised data can be fed back into the training process to retrain the model for better accuracy.
In production, ModBot uses a threshold for automatic moderation. As seen in the slide, anything above 0.8 will be automatically deleted and below 0.2 will be automatically approved. Human reviewers can step in if a reader flags a comment. The system also gives these reviewers flexibility to decide the threshold. Note here that the number of comments is not evenly distributed along this threshold due to the dataset’s imbalanced nature. Because there is a tradeoff between automatic moderation and accuracy, the data science team needs to work with stakeholders to set this threshold in production.
1.4 — Harnessing The Power of NLP at The Vector Institute
Developing and employing Natural Language Processing (NLP) models have become progressively more challenging as model complexity increases, datasets grow in size, and computational requirements rise. These hurdles limit the accessibility many organizations have to NLP capabilities, putting the significant benefits advanced NLP can provide out of reach. Sedef Kocak gave a talk about a collaborative project conducted at The Vector Institute that explores how state-of-the-art NLP models could be applied in business and industry settings at scale.
To give the context, the Vector Institute drives excellence and leadership in Canada’s knowledge, creation, and use of AI to foster economic growth and improve Canadians’ lives. It was created by visionary scientists and entrepreneurs who have lived the challenges in creating commercial AI technologies. The institute has 500+ researchers with domain-leading expertise in all areas of machine learning, 900+ participants in its programs and courses, and 1000+ MS students enrolled in their academic offerings. It aims to bring industry and researchers together to apply state-of-the-art solutions to specific industry-related problems.
Current NLP solutions require massive infrastructure/computation and trained human resources. The goal of Vector’s NLP project was retraining deep language models at scale to significantly reduce the cost of training NLP models while increasing the accessibility and benefits for businesses and researchers. The project has three focus areas:
1 — Domain-Specific Training: The three domains they focus on are health, finance, and legal.
- For the health domain, they pre-trained language representations in the biomedical domain by replicating BioBERT and fine-tuning it to Named Entity Recognition, Regular Expressions, and Question Answering tasks. They also conducted an experimental evaluation of Transformer-based language models in the biomedical domain to answer questions like (1) Does domain-specific training improve performance compared to baseline models trained on domain-specific corpora? and (2) Is it possible to obtain comparable results from a domain-specific BERT model pre-trained on smaller-sized data?
- For the finance domain, they investigated use cases of Transformer-based language models in finance text. In particular, they came up with finance-specific training for BERT, created a finance training corpus that covers versatile styles and sources of finance text, and proposed a semi-automated strategy of generating fine-tuning tasks on any domain.
- For the legal domain, they looked at tokenization and weight initialization approaches to adapt a contextualized language model to the legal domain.
2 — Pre-Training Large Models: This work addressed the limitation of training the BERT pre-trained model. They presented optimizations on improving single device training throughput, distributing the training workload over multiple nodes and GPUs, and overcoming the communication bottleneck introduced by the large data exchanges over the network.
3 — Summarization, Question Answering, and Machine Translation: Finally, they also had other initiatives related to different NLP tasks, including (1) developing domain-specific text summarization datasets, (2) exploring masked sequence-to-sequence multi-node unsupervised machine translation, and (3) building question-answering systems in responding to the COVID-19 open research dataset challenge.
Sedef concluded the talk with a couple of key takeaways:
- Large language models are challenging — due to their “black-box” nature, dataset size, hyper-parameter sensitivity, and computational resources.
- Domain knowledge improves the performance of NLP tasks for the domains.
- Small-sized datasets can be useful for model retraining.
- Domain-specific pre-training could improve fine-tuning tasks.
- Collaboration between different subject matter experts is a tough organizational challenge — due to time commitment, participant turnover, and knowledge localization.
- Best practices to organize this sort of large-scale collaboration are (1) getting quick wins, (2) monitoring group progress, and (3) trying out experimental learning.
1.5 — Generating Synthetic Data at Arima
A synthetic dataset is a data object generated programmatically. It is often necessary for situations where data privacy is a concern or when collecting data is difficult or costly. Although it is a fundamental step for many data science tasks, an efficient and standard framework is absent. In his talk “A Machine Learning-Based Privacy-Preserving Framework For Generating Synthetic Data From Aggregated Sources,” Winston Li from Arima studies a specific synthetic data generation task called downscaling, a procedure to infer high-resolution information from low-resolution variables, and proposes a multi-stage framework.
Here is a quick primer about the synthetic population:
- It is a statistical reconstruction of individual-level data where the ground truth is not available.
- It is built from available sources, which are generally aggregated geographically.
- It is statistically equivalent to a real population from a data science perspective.
Generally speaking, synthetic data is useful for scenarios when we need to work with multiple data sources — data is fragmented (each source has its own structure), privacy is required (data needs to be aggregated/anonymized to preserve privacy), and there are no obvious ways to link datasets (missing data). How can we create a data fusion to produce a more consistent, accurate, and useful dataset?
To fill this gap, in collaboration with other academic labs, the Arima team proposed a multi-stage framework called SynC (Synthetic Population via Gaussian Copula):
- SynC first removes potential outliers in the data.
- Then, SynC fits the filtered data with a Gaussian copula model to correctly capture dependencies and marginal distributions of sampled survey data.
- Finally, SynC leverages predictive models to merge datasets into one and then scales them accordingly to match the marginal constraints.
- There are two key assumptions at play: (1) Correlation is preserved under aggregation, and (2) Synthetic data needs to be calibrated to match the given aggregated data.
The main data sources that Arima works with are open data (census, StatCan research), syndicated research (psychographics, healthcare, media usage, financials, leisure), and partnership data (credit card transactions, credit ratings, location tracking). In the end, their synthetic population is a national, individual-level dataset with 4000+ variables based on all Canadians. No personally identifiable information (PII) was used to respect privacy laws.
The Arima team also created a synthetic population matching API, which takes away the tedious work around data acquisition and cleaning. Data scientists can, as a result, focus on building the most robust machine learning models.
1.6 — Understanding Content at Tubi
Jaya Kawale gave a talk about how Tubi, an advertiser-based video-on-demand service that allows its users to watch content online, uses Natural Language Processing (NLP) to understand the content. With more than 33 million monthly active users and 30 thousand titles, Tubi aims to use machine learning to understand user preferences, improve video recommendations, influence buying decisions, address cold-start behavior, and categorize video content.
For a lot of the content, there is a large amount of textual data in user reviews, synopsis, title plots, and even Wikipedia. Furthermore, there is a large amount of metadata in actors, ratings, year of release, studio, etc. To make sense of them, the team at Tubi applied various NLP methods ranging from simple (continuous bag of words, Skipgram, word2vec, doc2vec) to complex (BERT, knowledge graphs) ones. Here are the lessons that Jaya shared:
- Not all text is the same (e.g., reviews vs. subtitles vs. synopsis).
- Different tasks require different texts (e.g., sentiment analysis vs. text summarization).
- Averaging is a widely used method but can lead to information loss (e.g., multiple reviews for a title averaged together to generate a title embedding).
- Be careful with the choice of algorithms (e.g., BERT is more suitable for next sentence prediction). There is “no free lunch” in terms of algorithms and representations.
- Pre-processing and cleaning up is very critical.
- Evaluation is hard but critical (e.g., embedding quality assessment on surrogate tasks).
In the end, they built Spock, a platform for data ingestion, pre-processing, and cleaning. It generates a variety of embeddings for different use cases across the product. There are also assessments to help assess the quality of the embeddings via surrogate tasks. As seen above:
- The inputs include first and third-party data, viewer-oriented data, and other content metadata.
- All this information goes into Spock, which is then cleaned and pre-processed into products. These products can be in the form of embeddings, models, or beam from the universe to the Tubiverse.
- Several use cases can use these content understanding products, such as addressing cold-start behavior, assessing content value, analyzing portfolios, setting up pricing tiers, augmenting search, seeding growth, pursuing new audience, etc.
Jaya ended the talk with three concrete future directions for Spock: (1) improve natural language understanding to better construct embeddings, (2) handle different languages for new geographical regions, and (3) unify embeddings across different use cases.
1.7 — Addressing Cold-Start Issues at Tractable
Despite the remarkable results achieved by deep neural networks in recent years, they are data-hungry, and their performance relies heavily on the quality and size of the training data. In real-world scenarios, this can increase the time to value being added significantly for businesses, considering that collecting huge amounts of labeled data is very time- and cost-consuming. This phenomenon — known as the cold start problem — is a pain point for almost any company that wants to scale its machine learning applications. In their talk “Overcoming The Cold Start Problem — How To Make New Tasks Tractable,” Azin Asgarian from Georgian and Franziska Kirschner from Tractable demonstrated how this problem could be addressed via aggregating data across sources and leveraging previously trained models.
Tractable is a UK-based AI company that uses computer vision to automate accident and disaster discovery. Here’s how their product works for the vehicle damage use case:
- The vehicle owner uploads an estimate and pictures of the damaged vehicle to his/her claims management system.
- Tractable’s AI, which is trained on millions of real images of car accidents and repair operations, then compares the pictures and the estimate to accurately judge the repair operations.
- The vehicle owner receives an assessment, flagging any potential inaccuracies.
Being in 13 countries over 3 continents globally, Tractable’s AI deals with cars worldwide, which look different and are repaired differently even for the same damage. As a result, a model that classifies car damage well in one country will perform poorly in new geographies due to shifts in the data. The Tractable team partnered with the Georgian team to quickly and efficiently adapt to these unique data shifts to overcome such a cold-start problem. The proposed method aims to improve one customer’s performance (the target) with access to enormous available data/models from other customers (the source).
In particular, the cold-start problem is usually caused by three types of data shifts: input shift, output shift, and conditional shift. To address input shift, they rely on two types of transfer learning methods.
- Instance-based transfer learning methods try to re-weight the source domain samples to correct for marginal distribution differences. These re-weighted instances are then directly used in the target domain for training. Using the re-weighted source samples helps the target learner use only the source domain’s relevant information.
- Feature-based transfer learning methods map the feature spaces from both source and target domains into lower-dimensional spaces while minimizing the distances between similar samples in both domains.
To address output and conditional shifts (which happen more frequently in the real world), they use parameter-based transfer learning methods, which transfer knowledge through the shared parameters of the source and target domain learner models. The key idea behind parameter-based methods is that a well-trained model on the source domain has learned a well-defined structure, and if two tasks are related, this structure can be transferred to the target model. They also experimented with ensemble learning, which concatenates various pre-trained models for various tasks into a strong model for a new task.
Tractable’s AI builds a visual damage assessment module using the metadata from car images, which then creates an abstract, domain-independent representation of the damages. This representation contains all the necessary information for the domain adaptation task. Having access to it, they built a domain-adaptable layer that adapts the repair methodology for each geography. The business outcomes of starting warm were highly encouraging — quicker and more efficient expansion to new markets, reduced costs for data collection and labeling, faster customer on-boarding, earlier revenue recognition, and elevated brand.