MoralCLIP

Contrastive alignment of vision-and-language representations with Moral Foundations Theory (MFT)

Ana Carolina Condez, Diogo Tavares, João Magalhães

NOVA School of Science and Technology (FCT NOVA), NOVA LINCS — Lisbon, Portugal

Multimodal CLIP Moral Foundations Embedding Space

Paper (ACM DL) Resources Dataset Cite

Moral Foundations Theory

Our model aligns multimodal representations across five fundamental moral dimensions, each with opposing virtue– vice pairs.

Care

Fairness

Loyalty

Respect

Sanctity

Harm

Cheating

Betrayal

Subversion

Degradation

🎉 Accepted to ACM Multimedia 2025! This work will be presented at the Brave New Ideas Track on October 31st, 2025 in Dublin, Ireland.

Abstract (short)

MoralCLIP extends multimodal learning with explicit moral grounding based on Moral Foundations Theory (MFT). By integrating visual and textual moral cues into a unified embedding space, the model aligns inputs by shared moral meaning—not only by semantic similarity—enabling morally-aware cross-modal retrieval and analysis.

See full abstract in the paper.

Highlights

Morally-grounded embeddings: A CLIP-style contrastive objective augmented with moral supervision.
New multimodal moral dataset: ~15k image–text pairs with MFT-aligned multi-labels (via expert labels + augmentation).
Visual Moral Compass: A high-precision moral image classifier used to scale annotations and generate captions.
Improved moral understanding: Gains across unimodal and multimodal analyses of moral content.

Resources

📄 Paper: ACM Digital Library | arXiv (with appendixes)
🧩 Model: Hugging Face — MoralCLIP Base
💻 Code: GitHub Repository
🗂️ Dataset: Dataset card & splits

Pairs

≈15,000

Foundations

5 (MFT)

Modalities

Image ↔ Text

CLIP-Base MoralCLIP-Augmented SafeCLIP-Large

Planned Usage (preview)

Coming Soon

Dataset

The MoralCLIP dataset provides multi-label annotations for the five Moral Foundations (care, fairness, loyalty, authority, purity) across image–text pairs. It is designed for training and evaluating morally-aware multimodal models.

Download: 🤗 Hugging Face Dataset
Contents: image, text, labels (multi-hot over MFT).
Splits: train / validation / test with metadata.
License: research-use; check third-party data licenses.

Citation

If you use MoralCLIP in your research, please cite:

@inproceedings{10.1145/3746027.3758166,
      author = {Condez, Ana Carolina and Tavares, Diogo and Magalh\~{a}es, Jo\~{a}o},
      title = {MoralCLIP: Contrastive Alignment of Vision-and-Language Representations with Moral Foundations Theory},
      year = {2025},
      isbn = {9798400720352},
      publisher = {Association for Computing Machinery},
      address = {New York, NY, USA},
      url = {https://doi.org/10.1145/3746027.3758166},
      doi = {10.1145/3746027.3758166},
      booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia},
      pages = {12399–12408},
      numpages = {10},
      keywords = {ai, clip, ethics, mft, moral, moral foundations, moralclip},
      location = {Dublin, Ireland},
      series = {MM '25}
    }

Ethical Considerations

Morality is pluralistic and context-dependent; model outputs should be interpreted with care.
Training involved expert-labeled and augmented data; annotation biases and cultural variance may persist.

License & Acknowledgements

Code and models will be released under a permissive research license. Portions of the dataset leverage SMID (Crone et al., 2018) annotations; please consult original licenses for any third-party data.