MoralCLIP

Contrastive alignment of vision-and-language representations with Moral Foundations Theory (MFT)

Ana Carolina Condez, Diogo Tavares, João Magalhães

NOVA School of Science and Technology (FCT NOVA), NOVA LINCS — Lisbon, Portugal

Multimodal CLIP Moral Foundations Embedding Space

Moral Foundations Theory

Our model aligns multimodal representations across five fundamental moral dimensions, each with opposing virtuevice pairs.

Care
Care
Fairness
Fairness
Loyalty
Loyalty
Respect
Respect
Sanctity
Sanctity
vs
Harm
Harm
Cheating
Cheating
Betrayal
Betrayal
Subversion
Subversion
Degradation
Degradation
🎉 Accepted to ACM Multimedia 2025! This work will be presented at the Brave New Ideas Track on October 31st, 2025 in Dublin, Ireland.

Abstract (short)

MoralCLIP extends multimodal learning with explicit moral grounding based on Moral Foundations Theory (MFT). By integrating visual and textual moral cues into a unified embedding space, the model aligns inputs by shared moral meaning—not only by semantic similarity—enabling morally-aware cross-modal retrieval and analysis.

See full abstract in the paper.

Highlights

  • Morally-grounded embeddings: A CLIP-style contrastive objective augmented with moral supervision.
  • New multimodal moral dataset: ~15k image–text pairs with MFT-aligned multi-labels (via expert labels + augmentation).
  • Visual Moral Compass: A high-precision moral image classifier used to scale annotations and generate captions.
  • Improved moral understanding: Gains across unimodal and multimodal analyses of moral content.

Resources

Pairs
≈15,000
Foundations
5 (MFT)
Modalities
Image ↔ Text
CLIP-Base MoralCLIP-Augmented SafeCLIP-Large

Planned Usage (preview)

Coming Soon

Dataset

The MoralCLIP dataset provides multi-label annotations for the five Moral Foundations (care, fairness, loyalty, authority, purity) across image–text pairs. It is designed for training and evaluating morally-aware multimodal models.

Citation

If you use MoralCLIP in your research, please cite:

@inproceedings{10.1145/3746027.3758166,
      author = {Condez, Ana Carolina and Tavares, Diogo and Magalh\~{a}es, Jo\~{a}o},
      title = {MoralCLIP: Contrastive Alignment of Vision-and-Language Representations with Moral Foundations Theory},
      year = {2025},
      isbn = {9798400720352},
      publisher = {Association for Computing Machinery},
      address = {New York, NY, USA},
      url = {https://doi.org/10.1145/3746027.3758166},
      doi = {10.1145/3746027.3758166},
      booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia},
      pages = {12399–12408},
      numpages = {10},
      keywords = {ai, clip, ethics, mft, moral, moral foundations, moralclip},
      location = {Dublin, Ireland},
      series = {MM '25}
    }

Ethical Considerations

  • Morality is pluralistic and context-dependent; model outputs should be interpreted with care.
  • Training involved expert-labeled and augmented data; annotation biases and cultural variance may persist.

License & Acknowledgements

Code and models will be released under a permissive research license. Portions of the dataset leverage SMID (Crone et al., 2018) annotations; please consult original licenses for any third-party data.