May 1, 2026
When One Model Isn’t Enough: Building Multimodal Adult Content Detection for Mercari B2C
Abstract Every B2C product listing on Mercari Japan must be screened for adult content before it reaches buyers. We built a multimodal ML pipeline that analyses both product images and listing text to make these decisions at scale. This post describes the system’s design: a custom PyTorch fusion model that combines MobileNet V2 image embeddings with Japanese BERT text embeddings, run in parallel with third-party API, with an OR gate combining their outputs. We cover our choice of fusion strategies (concatenation MLP vs. cross-attention), why we decided to run two independent classifiers rather than relying on one, and what we found when we attempted to improve performance by sub-classifying adult contents into finer-grained categories. We also outline the offline evaluation framework used to validate these decisions. Content Moderation on a B2C Marketplace Managing the scale of Mercari Japan’s B2C catalog involves screening millions of listings, where every entry consists of seller-uploaded imagery and unstructured Japanese text (titles, descriptions, and category metadata). Ensuring these listings adhere to our safety policies is paramount, with adult content detection representing one of our most critical moderation challenges. The fundamental difficulty lies in the fact that “adult content” is a spectrum rather than a discrete binary class. On one end, we find unambiguous violations: explicit imagery or text that is clearly prohibited. At the other end, however, are edge cases where the content is benign in isolation but becomes problematic when combined. For instance, a piece of lingerie photographed on a mannequin is a standard product shot. The same garment, described with sexually suggestive language, shifts the listing into a different category. An art book containing classical nude paintings is a legitimate product. A cropped detail from one of those paintings, listed without context, raises different questions. This nuance implies that relying on a single… <a class="more-link" href="https://about.in.mercari.com/news/blog/when-one-model-isnt-enough-building-multimodal-adult-content-detection-for-mercari-b2c/">Continue reading <span class="screen-reader-text">When One Model Isn’t Enough: Building Multimodal Adult Content Detection for Mercari B2C</span></a>



