Purrturbed but Stable: Human-Cat Invariant Representations Across CNNs, ViTs and Self-Supervised ViTs
Overview
Process Overview: A total of 191 videos containing cat POV are sourced from the internet. Our biologically informed cat vision filter is applied to individual frames, creating pairs of original (human) vs. cat vision filtered frames which pass through a suite of frozen vision encoders. The extracted features are then subjected to statistical tests including CKA, RSA, and distributional analysis.
Abstract
Cats and humans differ in ocular anatomy. Most notably, Felis Catus (domestic cats) have vertically elongated pupils linked to ambush predation; yet, how such specializations manifest in downstream visual representations remains incompletely understood. We present a unified, frozen-encoder benchmark that quantifies feline-human cross-species representational alignment in the wild, across convolutional networks, supervised Vision Transformers, windowed transformers, and self-supervised ViTs (DINO), using layer-wise Centered Kernel Alignment (linear and RBF) and Representational Similarity Analysis, with additional distributional and stability tests reported in the paper. Across models, DINO ViT-B/16 attains the most substantial alignment (mean CKA-RBF ≈0.814, mean CKA-linear ≈0.745, mean RSA ≈0.698), peaking at early blocks, indicating that token-level self-supervision induces early-stage features that bridge species-specific statistics. Supervised ViTs are competitive on CKA yet show weaker geometric correspondence than DINO (e.g., ViT-B/16 RSA ≈0.53 at block8; ViT-L/16 ≈0.47 at block14), revealing depth-dependent divergences between similarity and representational geometry. CNNs remain strong baselines but below plain ViTs on alignment, and windowed transformers underperform plain ViTs, implicating architectural inductive biases in cross-species alignment. Results indicate that self-supervision coupled with ViT inductive biases yields representational geometries that more closely align feline and human visual systems than widely used CNNs and windowed Transformers, providing testable neuroscientific hypotheses about where and how cross-species visual computations converge.
Methodology
We study cross-species representational invariance by comparing layer-wise activations from diverse vision encoders on paired views of the same scenes captured in human and feline domains. All encoders are kept frozen to isolate the inductive biases of the pretrained features, enabling apples-to-apples comparisons across architectures and training paradigms.
Dataset
We curate a paired image dataset to enable controlled cross-species comparisons on the same scenes. POV (point-of-view) videos of domestic cats with a camera strapped to their neck are collected from the internet and temporally aligned and decomposed into frames; images are then paired at the filename level to ensure one-to-one correspondences. Our curated dataset consists of 191 videos, containing over 300,000 frame-pairs.
Biologically Informed Cat-Vision Filter
To probe whether architectural invariances persist under species-specific optics and early vision, we apply a biologically informed image transformation that approximates key feline visual characteristics. The transformation models: (i) spectral sensitivity with rod dominance and reduced long-wavelength sensitivity; (ii) spatial acuity and peripheral falloff; (iii) extended field-of-view distortions; (iv) temporal sensitivity and elevated flicker fusion; (v) motion sensitivity with horizontal bias; (vi) vertical-slit pupil optics; and (vii) tapetum lucidum low-light enhancement.
Figure: Biologically informed cat-vision filter components including photoreceptor spectral sensitivity, temporal frequency response, vertical slit pupil kernel, and spectral/visual acuity map.
Models and Feature Extraction
All encoders are used with their canonical pretrained weights and are not fine-tuned. We evaluate three families: CNNs (ResNet, DenseNet, EfficientNet, ConvNeXt, MobileNet), supervised transformers (ViT-B/L, Swin-T/S/B), and self-supervised transformers (DINO, DINOv2/v3). For convolutional encoders, we probe semantically comparable stages spanning early, middle, and late processing. For transformer encoders, we probe block-wise token representations.
Alignment Metrics
We quantify alignment using complementary geometric and statistical tools:
- Centered Kernel Alignment (CKA): Both linear and RBF variants to measure representational similarity
- Representational Similarity Analysis (RSA): Cosine-based dissimilarity matrices with Mantel permutation testing
- Distribution shift tests: Maximum Mean Discrepancy (MMD), Energy distance, and projected 1-Wasserstein distance
- Paired similarity: Per-pair cosine similarity and Euclidean distance with statistical testing
All p-values are corrected using the Benjamini-Hochberg procedure (FDR level 0.05) to control false discoveries across the large grid of model-layer-metric combinations.
Figure: t-SNE and UMAP embeddings showing domain overlap patterns across CNN model families.
Figure: t-SNE and UMAP embeddings showing domain overlap patterns across Transformer model families.
Figure: t-SNE and UMAP embeddings showing domain overlap patterns across DINO model families.
Results
We evaluate cross-species representational alignment on frozen encoders across three families: CNNs (ResNet, MobileNet, DenseNet, EfficientNet, ConvNeXt), supervised transformers (ViT, Swin), and self-supervised transformers (DINO, DINOv2/v3). We report layer-wise alignment aggregated per model via CKA (linear and RBF), RSA/Mantel, and distribution-shift and paired metrics.
Overall Performance
Across families, the self-supervised Vision Transformer DINO ViT-B/16 achieves the highest mean RBF-CKA (0.8144), closely followed by supervised ViT-L/16 (0.8057). Among CNNs, EfficientNet-B3 yields the strongest mean RBF-CKA (0.7017). These results indicate that transformer-based encoders, particularly self-supervised ViTs, preserve stronger cross-species invariances under our paired design.
| Family | Model | Best Layer/Block | CKA-RBF (mean) | CKA-Linear (mean) | RSA (mean) | Mean Cosine |
|---|---|---|---|---|---|---|
| CNN | EfficientNet-B3 | stage5 | 0.7017 | 0.6371 | 0.5344 | 0.6308 |
| CNN | ResNet-50 | layer3 | 0.6902 | 0.6628 | 0.4876 | 0.6022 |
| CNN | DenseNet-169 | db3 | 0.6853 | 0.6166 | 0.5417 | 0.7036 |
| CNN | EfficientNet-B1 | stage5 | 0.6838 | 0.6389 | 0.5107 | 0.4939 |
| CNN | ConvNeXt-L | stage1 | 0.5599 | 0.5355 | 0.5428 | 0.8292 |
| Transformer (sup.) | ViT-L/16 | block14 | 0.8057 | 0.7050 | 0.4647 | 0.5960 |
| Transformer (sup.) | ViT-B/16 | block8 | 0.7755 | 0.6840 | 0.5266 | 0.6943 |
| Transformer (sup.) | Swin-B | stage3 | 0.4688 | 0.4269 | 0.3818 | 0.6110 |
| Self-sup. (DINO) | DINO ViT-B/16 | block0 | 0.8144 | 0.7446 | 0.6980 | 0.7995 |
| Self-sup. (DINO) | DINO ViT-S/16 | block0 | 0.7682 | 0.6991 | 0.6668 | 0.8384 |
| Self-sup. (DINOv2) | DINOv2-Base | block0 | 0.7232 | 0.6082 | 0.5669 | 0.8454 |
| Overall Best | DINO ViT-B/16 (block0) | 0.8144 | 0.7446 | 0.6980 | 0.7995 | |
Family-Specific Observations
1. CNNs: EfficientNet variants perform strongly, with B3 (CKA-RBF mean 0.7017) leading, followed closely by ResNet-50 (0.6902) and DenseNet-169 (0.6853). Best alignment typically occurs at later blocks (e.g., EfficientNet-B3 stage5; ResNet-50 layer3), consistent with hierarchical convergence.
2. Supervised Transformers: ViT-L/16 (0.8057) outperforms Swin variants by a wide margin; best alignment arises at deeper transformer blocks (block14 for ViT-L/16; block8 for ViT-B/16).
3. Self-Supervised Transformers: DINO ViT-B/16 achieves the highest overall alignment (0.8144); DINOv2-Base is strong (0.7232), while DINOv3 pretrain variants show moderate alignment in our setting.
Figure: DINO models cluster in the upper-right region (high performance on both metrics), with DINO-ViT-B/16 achieving the highest RSA Spearman performance (0.698). CNN models form a tight cluster in the middle region, while Transformer models show high variability.
Layer-wise Dissimilarity Analysis
Beyond alignment, we systematically localize layers with strongest dissimilarity signals using low CKA/RSA and high projected 1D Wasserstein. Three robust patterns emerge across families:
- Lowest alignment concentrates early: Initial CNN convolutions (e.g., ResNet/conv1) and the earliest blocks in ViT/DINO variants tend to have the lowest CKA-Linear and RSA
- Distributional shift can peak late: Deeper EfficientNet stages and late ViT blocks exhibit the largest Wasserstein shifts while still retaining moderate alignment
- Self-supervised giants can decouple geometry and distribution: DINOv3 large models show very high late-block Wasserstein despite competitive CKA/RSA
Mantel permutation tests, MMD, and Energy distance frequently reject the null across layers, confirming measurable distributional differences even when alignment is high. Nevertheless, the leading models maintain robust cross-domain alignment by CKA and RSA, suggesting shape/semantic consistency despite domain shifts.
Additional Visualizations
CNN family embeddings with t-SNE and UMAP projections showing domain separation patterns.
Supervised Transformer embeddings revealing stronger cross-domain overlap compared to CNNs.
Self-supervised DINO model embeddings demonstrating the highest degree of human-cat representational alignment.
BibTeX
@misc{shah2025purrturbedstablehumancatinvariant,
title={Purrturbed but Stable: Human-Cat Invariant Representations Across CNNs, ViTs and Self-Supervised ViTs},
author={Arya Shah and Vaibhav Tripathi},
year={2025},
eprint={2511.02404},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.02404},
}