When One Model Isn’t Enough: Building Multimodal Adult Content Detection for Mercari B2C

Abstract

Every B2C product listing on Mercari must be screened for adult content before it reaches buyers. We built a multimodal ML pipeline that analyses both product images and listing text to make these decisions at scale. This post describes the system’s design: a custom PyTorch fusion model that combines MobileNet V2 image embeddings with Japanese BERT text embeddings, run in parallel with third-party API, with an OR gate combining their outputs. We cover our choice of fusion strategies (concatenation MLP vs. cross-attention), why we decided to run two independent classifiers rather than relying on one, and what we found when we attempted to improve performance by sub-classifying adult contents into finer-grained categories. We also outline the offline evaluation framework used to validate these decisions.

Content Moderation on a B2C Marketplace

Managing the scale of Mercari’s B2C catalog involves screening millions of listings, where every entry consists of seller-uploaded imagery and unstructured Japanese text (titles, descriptions, and category metadata). Ensuring these listings adhere to our safety policies is paramount, with adult content detection representing one of our most critical moderation challenges.

The fundamental difficulty lies in the fact that “adult content” is a spectrum rather than a discrete binary class. On one end, we find unambiguous violations: explicit imagery or text that is clearly prohibited. At the other end, however, are edge cases where the content is benign in isolation but becomes problematic when combined. For instance, a piece of lingerie photographed on a mannequin is a standard product shot. The same garment, described with sexually suggestive language, shifts the listing into a different category. An art book containing classical nude paintings is a legitimate product. A cropped detail from one of those paintings, listed without context, raises different questions.

This nuance implies that relying on a single modality, whether image or text alone, is insufficient for robust moderation. The violation often resides in the latent relationship between what is visualised and what is written. A system that processes these signals independently will systematically overlook context-dependent violations (producing false negatives) or flag legitimate items where one modality provides necessary context (producing false positives).

This complexity serves as the core motivation for the multimodal architecture detailed in this post: a system engineered to evaluate imagery and text jointly, capturing the signal where it actually lives.

The Problem: Why Images Alone Are Not Enough

Whether a product listing contains adult content is usually obvious to a human reviewer, but manual review does not scale to the volume of Mercari’s B2C catalog. The natural starting point is an image classifier : train a CNN on labeled adult listing images and deploy it. In practice, however, image-only classification misses a critical dimension of the problem: context.

A DVD cover featuring explicit adult artwork will rightly trigger an image classifier. But the same cover image, listed with a title and description identifying it as a sealed, commercially released film in a standard product category, is a legitimate secondhand goods listing. Without the text, the image looks like a policy violation. With the text, it is a routine product shot. The label depends on the relationship between what the buyer sees and what the seller writes. Context lives in the text as much as the image, and a moderation system that ignores text will produce both false positives and false negatives that are difficult to resolve through threshold tuning alone.

This led us to frame the problem as multimodal classification: the model must consider both the image and the text together.

Multimodal Fusion Architecture

A naive approach to multimodal moderation would be to run separate image and text classifiers and flag if either fires. This captures no cross-modal interaction, i.e, the model cannot learn that a particular image is benign in one textual context but problematic in another. We opted instead for genuine multimodal fusion, where image and text representations are combined before classification.

The architecture has two stages: encoding and fusion.

Turning Images and Text into Embeddings

For images, we use MobileNet V2 as a feature extractor. We strip the classification head and keep everything up to global average pooling, which gives us a 1280-dimensional embedding per image. The choice of MobileNet V2 over heavier architectures (ResNet-152, EfficientNet-B7) was a throughput decision. This model runs on every single listing. If the encoder is slow, everything downstream is slow. MobileNet V2 gives us low latency inference on CPU with an embedding that is rich enough for our fusion model to work with. We tried larger encoders during experimentation and the accuracy gains were marginal compared to the latency cost.

When a listing has multiple images (and most do) we encode each one individually and average the embeddings. This is the simplest multi-image aggregation strategy, and it has an obvious weakness: a single violating image in a set of ten gets diluted by nine clean ones. We considered alternatives, including max-pooling and attention-weighted aggregation over individual image embeddings. In practice, averaging performed adequately. We may revisit this with a more expressive aggregation strategy in the future, but it has not been a bottleneck so far.

For text, we use a BERT model (loaded via HuggingFace Transformers) and take the [CLS] token’s hidden state as a 768-dimensional embedding. Because this is a Japanese marketplace, whitespace tokenisation is not applicable. The encoder uses fugashi and unidic-lite for morphological analysis, segmenting Japanese text into proper tokens before it reaches BERT. The model in our registry was pre-trained with this same tokenisation pipeline, so the segmentation is consistent between training and inference. A mismatch here for example, using a tokenizer designed for English whitespace splitting would silently degrade embedding quality.

Fusing the Modalities

With a 1280-d image vector and a 768-d text vector in hand, the fusion model’s job is to combine them into a single adult-content probability.

We built the AdultFusionModel to support multiple fusion strategies, selectable at the checkpoint level.

Concatenation fusion is the simpler strategy and our first approach. We concatenate the image and text embeddings into a single 2048-dimensional vector and pass it through a two-layer MLP with GELU activations and layer normalisation, projecting down to a 1024-dimensional fused representation. This is classic late fusion. The MLP learns whatever cross-modal interactions exist in the data, but the two modalities are treated as peers, neither gets special attention. It trains fast, it is stable, and it works well when the two modalities contribute somewhat independently to the decision.

Cross-attention fusion is the more expressive alternative that we went ahead with. Both modality embeddings are projected into a shared hidden space, and bidirectional multi-head attention is applied. The image representation attends to the text, and the text representation attends to the image, each with residual connections. This allows the model to learn context-dependent representations. For instance, that a particular image is benign in one textual context but problematic in another. In our experiments, cross-attention showed improvements on ambiguous edge cases where cross-modal context was important for the decision and the gains were significant on the overall evaluation set but came with slower training convergence and slightly higher inference latency.

After fusion, the 1024-d representation goes through a classification head, a small MLP ending in a sigmoid that outputs a probability between 0 and 1. If it exceeds a configurable threshold, the listing is flagged.

Multi-Label Subcategorisation: An Experiment That Did Not Improve Performance

Before settling on binary classification, we explored whether sub-classifying adult content into finer-grained categories would improve detection performance. The hypothesis was that a model trained to distinguish between subcategories (explicit nudity, suggestive content, partial nudity, and similar distinctions) would learn richer internal representations and produce better overall adult content detection than a model trained only on a binary label. As a secondary benefit, subcategory predictions would provide the moderation operations team with more actionable signal for prioritizing review queues.

The AdultFusionModel architecture supports num_heads > 1, so training multi-head variants was straightforward. We trained several configurations with varying numbers of subcategories, using both shared-fusion-independent-heads and fully independent head architectures.

The results did not support the hypothesis. On our evaluation set, the multi-labels models performed roughly on par with the binary classifier on the aggregated detection metric (the “is this adult content at all?” question, obtained by OR-ing across subcategory heads). This was likely because the training signal was fragmented across categories that were difficult to annotate consistently. The subcategory predictions themselves exhibited high variance: inter-annotator agreement on distinctions like “suggestive” versus “partial nudity” was low, and this annotation noise propagated into the model.

We reverted to binary classification. The conclusion is not that multi-label adult content detection is fundamentally infeasible, but that it requires higher annotation consistency than our current dataset provides. The operational value of subcategory labels does not yet justify the annotation investment required to support them. This remains an area we expect to revisit as the labeled dataset grows and annotation guidelines are refined.

Offline Evaluation

We evaluated our models on a manually labeled dataset of 405 products. The class distribution skews toward negatives, which is important to keep in mind when interpreting precision and recall.

Using this dataset, we compared three approaches: the third-party API alone, our in-house model alone, and a combined pipeline. The key trade-off we were watching was precision vs. recall.

The third-party API achieves the highest precision (92.42%) but catches only about half of the true positives, making it too conservative for our use case. Our in-house model substantially improves recall to 86.21% while maintaining a reasonable precision of 79.36%. The combined approach pushes recall even further to 87.93% with only a marginal drop in precision. Overall, the in-house model and combined pipeline both deliver an F1 score around 82.6%, a significant improvement over the third-party API’s 67.03%.

The Dual-Classifier Strategy

The fusion model is the core of the system, but it is not the only classifier. We also run a third-party API on every listing, in parallel, through a separate service. The two classifiers’ outputs are combined with an OR gate: if either one flags a listing, it is flagged.

This dual-path architecture has a real cost, API pricing on every listing (which is zero at the moment), the complexity of maintaining two independent systems, and a higher false-positive rate from the OR aggregation. The justification comes from two factors.

First, the two classifiers have complementary coverage. Our fusion model is trained on Mercari-specific data, runs with low latency, and gives us full control over the threshold. The third-party API accepts images, text, or both, and is trained on broader data. It catches patterns our in-house model has not seen in training; conversely, our model identifies Mercari-specific violations that a general-purpose API misses. Running both and OR-ing the results gives us broader coverage than either path alone.

Second, the OR gate reflects a deliberate prioritisation of recall over precision. In content moderation, the cost of errors is asymmetric. A false positive sends a legitimate listing to human review which is a recoverable inconvenience. A false negative allows adult content to reach buyers, with regulatory and reputational consequences that are difficult to undo. The OR gate accepts the precision cost in exchange for higher recall.

An additional benefit that became apparent in production is operational resilience. During a third-party API degradation, the in-house model will continue processing without interruption. During in-house model retraining, the API path will maintain coverage. At no point will the moderation pipeline be fully unavailable.

What We Learned

Honestly, the biggest lesson was about humility. We went into this project assuming the hard part would be the model architecture, finding the right fusion strategy, the right embedding dimensions, the right attention mechanism. And those things mattered, but the decisions that had the most practical impact were less glamorous.

The multi-label experiment was humbling in a different way. It was a good idea on paper, and we had the infrastructure to support it, but the data was not ready. Annotation consistency is a prerequisite, not an afterthought, and we underestimated how hard it is to get multiple annotators to agree on fuzzy subcategories like “suggestive” versus “mildly explicit.” Binary classification is a question humans can answer more consistently, and that consistency flows through to model quality.

And the choice to average multi-image embeddings rather than using something fancier? It works. We keep meaning to try attention-weighted pooling over individual image embeddings, and we keep not needing to because the simple average, combined with threshold tuning, handles the dilution issue well enough. Not every component needs to be sophisticated. Sometimes you should save the complexity budget for the parts that actually need it.

What Comes Next

The multi-head architecture is ready for new violation categories whenever the training data catches up. Dangerous goods, prohibited items, and other policy violations could each get their own classification head, trained jointly with the adult-content head on shared fusion features. The infrastructure supports it today; the bottleneck is labeled data.

We are also keeping an eye on how quickly third-party moderation APIs are improving. The dual-model strategy is valuable now, but it adds operational cost and complexity. If the API path becomes reliable and accurate enough on its own for certain categories, simplifying to a single path would reduce complexity. We plan to re-evaluate this periodically rather than treating the current architecture as permanent.

The multi-label adult content detection question is not dead, either. It did not give the expected result this time because of annotation quality, not because the idea is wrong. As our labeled dataset grows and we invest in clearer annotation guidelines, it is worth revisiting. Subcategory labels would genuinely help the operations team prioritise review queues, and the modelling infrastructure is already there.

Appendix: Abbreviations

AUC-ROC – Receiver Operating Characteristic Area Under the Curve

B2C – Business-to-Consumer

BERT – Bidirectional Encoder Representations from Transformers

CNN – Convolutional Neural Network

F1 – harmonic mean of precision and recall (F-score at β = 1)

GELU – Gaussian error linear unit

MLP – Multi-Layer Perceptron

Co-Authors: Anand Karunan & Harshit Ajmani

Leave a comment

Your email address will not be published. Required fields are marked *