Purrturbed but Stable: Human-Cat Invariant Representations Across CNNs, ViTs and Self-Supervised ViTs

Shah, Arya; Tripathi, Vaibhav

Purrturbed but Stable: Human-Cat Invariant Representations Across CNNs, ViTs and Self-Supervised ViTs

Arya Shah, Vaibhav Tripathi

Indian Institute of Technology Gandhinagar
Academic Research 2025

PyPI Package Dataset Code arXiv

A unified benchmark quantifying feline-human cross-species representational alignment across CNNs, Vision Transformers, and self-supervised ViTs using frozen encoders and layer-wise analysis.

Overview

Process Overview: A total of 191 videos containing cat POV are sourced from the internet. Our biologically informed cat vision filter is applied to individual frames, creating pairs of original (human) vs. cat vision filtered frames which pass through a suite of frozen vision encoders. The extracted features are then subjected to statistical tests including CKA, RSA, and distributional analysis.

Abstract

Cats and humans differ in ocular anatomy. Most notably, Felis Catus (domestic cats) have vertically elongated pupils linked to ambush predation; yet, how such specializations manifest in downstream visual representations remains incompletely understood. We present a unified, frozen-encoder benchmark that quantifies feline-human cross-species representational alignment in the wild, across convolutional networks, supervised Vision Transformers, windowed transformers, and self-supervised ViTs (DINO), using layer-wise Centered Kernel Alignment (linear and RBF) and Representational Similarity Analysis, with additional distributional and stability tests reported in the paper. Across models, DINO ViT-B/16 attains the most substantial alignment (mean CKA-RBF ≈0.814, mean CKA-linear ≈0.745, mean RSA ≈0.698), peaking at early blocks, indicating that token-level self-supervision induces early-stage features that bridge species-specific statistics. Supervised ViTs are competitive on CKA yet show weaker geometric correspondence than DINO (e.g., ViT-B/16 RSA ≈0.53 at block8; ViT-L/16 ≈0.47 at block14), revealing depth-dependent divergences between similarity and representational geometry. CNNs remain strong baselines but below plain ViTs on alignment, and windowed transformers underperform plain ViTs, implicating architectural inductive biases in cross-species alignment. Results indicate that self-supervision coupled with ViT inductive biases yields representational geometries that more closely align feline and human visual systems than widely used CNNs and windowed Transformers, providing testable neuroscientific hypotheses about where and how cross-species visual computations converge.

Methodology

We study cross-species representational invariance by comparing layer-wise activations from diverse vision encoders on paired views of the same scenes captured in human and feline domains. All encoders are kept frozen to isolate the inductive biases of the pretrained features, enabling apples-to-apples comparisons across architectures and training paradigms.

Dataset

We curate a paired image dataset to enable controlled cross-species comparisons on the same scenes. POV (point-of-view) videos of domestic cats with a camera strapped to their neck are collected from the internet and temporally aligned and decomposed into frames; images are then paired at the filename level to ensure one-to-one correspondences. Our curated dataset consists of 191 videos, containing over 300,000 frame-pairs.

Biologically Informed Cat-Vision Filter

To probe whether architectural invariances persist under species-specific optics and early vision, we apply a biologically informed image transformation that approximates key feline visual characteristics. The transformation models: (i) spectral sensitivity with rod dominance and reduced long-wavelength sensitivity; (ii) spatial acuity and peripheral falloff; (iii) extended field-of-view distortions; (iv) temporal sensitivity and elevated flicker fusion; (v) motion sensitivity with horizontal bias; (vi) vertical-slit pupil optics; and (vii) tapetum lucidum low-light enhancement.

Figure: Biologically informed cat-vision filter components including photoreceptor spectral sensitivity, temporal frequency response, vertical slit pupil kernel, and spectral/visual acuity map.

Models and Feature Extraction

All encoders are used with their canonical pretrained weights and are not fine-tuned. We evaluate three families: CNNs (ResNet, DenseNet, EfficientNet, ConvNeXt, MobileNet), supervised transformers (ViT-B/L, Swin-T/S/B), and self-supervised transformers (DINO, DINOv2/v3). For convolutional encoders, we probe semantically comparable stages spanning early, middle, and late processing. For transformer encoders, we probe block-wise token representations.

Alignment Metrics

We quantify alignment using complementary geometric and statistical tools:

Centered Kernel Alignment (CKA): Both linear and RBF variants to measure representational similarity
Representational Similarity Analysis (RSA): Cosine-based dissimilarity matrices with Mantel permutation testing
Distribution shift tests: Maximum Mean Discrepancy (MMD), Energy distance, and projected 1-Wasserstein distance
Paired similarity: Per-pair cosine similarity and Euclidean distance with statistical testing

All p-values are corrected using the Benjamini-Hochberg procedure (FDR level 0.05) to control false discoveries across the large grid of model-layer-metric combinations.

Figure: t-SNE and UMAP embeddings showing domain overlap patterns across CNN model families.

Figure: t-SNE and UMAP embeddings showing domain overlap patterns across Transformer model families.

Figure: t-SNE and UMAP embeddings showing domain overlap patterns across DINO model families.

Results

We evaluate cross-species representational alignment on frozen encoders across three families: CNNs (ResNet, MobileNet, DenseNet, EfficientNet, ConvNeXt), supervised transformers (ViT, Swin), and self-supervised transformers (DINO, DINOv2/v3). We report layer-wise alignment aggregated per model via CKA (linear and RBF), RSA/Mantel, and distribution-shift and paired metrics.

Overall Performance

Across families, the self-supervised Vision Transformer DINO ViT-B/16 achieves the highest mean RBF-CKA (0.8144), closely followed by supervised ViT-L/16 (0.8057). Among CNNs, EfficientNet-B3 yields the strongest mean RBF-CKA (0.7017). These results indicate that transformer-based encoders, particularly self-supervised ViTs, preserve stronger cross-species invariances under our paired design.

Table 1: Model-level aggregates from overall summaries. Best within each family and overall are bolded.
Family	Model	Best Layer/Block	CKA-RBF (mean)	CKA-Linear (mean)	RSA (mean)	Mean Cosine
CNN	EfficientNet-B3	stage5	0.7017	0.6371	0.5344	0.6308
CNN	ResNet-50	layer3	0.6902	0.6628	0.4876	0.6022
CNN	DenseNet-169	db3	0.6853	0.6166	0.5417	0.7036
CNN	EfficientNet-B1	stage5	0.6838	0.6389	0.5107	0.4939
CNN	ConvNeXt-L	stage1	0.5599	0.5355	0.5428	0.8292
Transformer (sup.)	ViT-L/16	block14	0.8057	0.7050	0.4647	0.5960
Transformer (sup.)	ViT-B/16	block8	0.7755	0.6840	0.5266	0.6943
Transformer (sup.)	Swin-B	stage3	0.4688	0.4269	0.3818	0.6110
Self-sup. (DINO)	DINO ViT-B/16	block0	0.8144	0.7446	0.6980	0.7995
Self-sup. (DINO)	DINO ViT-S/16	block0	0.7682	0.6991	0.6668	0.8384
Self-sup. (DINOv2)	DINOv2-Base	block0	0.7232	0.6082	0.5669	0.8454
Overall Best		DINO ViT-B/16 (block0)	0.8144	0.7446	0.6980	0.7995

Family-Specific Observations

1. CNNs: EfficientNet variants perform strongly, with B3 (CKA-RBF mean 0.7017) leading, followed closely by ResNet-50 (0.6902) and DenseNet-169 (0.6853). Best alignment typically occurs at later blocks (e.g., EfficientNet-B3 stage5; ResNet-50 layer3), consistent with hierarchical convergence.

2. Supervised Transformers: ViT-L/16 (0.8057) outperforms Swin variants by a wide margin; best alignment arises at deeper transformer blocks (block14 for ViT-L/16; block8 for ViT-B/16).

3. Self-Supervised Transformers: DINO ViT-B/16 achieves the highest overall alignment (0.8144); DINOv2-Base is strong (0.7232), while DINOv3 pretrain variants show moderate alignment in our setting.

Figure: DINO models cluster in the upper-right region (high performance on both metrics), with DINO-ViT-B/16 achieving the highest RSA Spearman performance (0.698). CNN models form a tight cluster in the middle region, while Transformer models show high variability.

Layer-wise Dissimilarity Analysis

Beyond alignment, we systematically localize layers with strongest dissimilarity signals using low CKA/RSA and high projected 1D Wasserstein. Three robust patterns emerge across families:

Lowest alignment concentrates early: Initial CNN convolutions (e.g., ResNet/conv1) and the earliest blocks in ViT/DINO variants tend to have the lowest CKA-Linear and RSA
Distributional shift can peak late: Deeper EfficientNet stages and late ViT blocks exhibit the largest Wasserstein shifts while still retaining moderate alignment
Self-supervised giants can decouple geometry and distribution: DINOv3 large models show very high late-block Wasserstein despite competitive CKA/RSA

Mantel permutation tests, MMD, and Energy distance frequently reject the null across layers, confirming measurable distributional differences even when alignment is high. Nevertheless, the leading models maintain robust cross-domain alignment by CKA and RSA, suggesting shape/semantic consistency despite domain shifts.

Additional Visualizations

CNN family embeddings with t-SNE and UMAP projections showing domain separation patterns.

Supervised Transformer embeddings revealing stronger cross-domain overlap compared to CNNs.

Self-supervised DINO model embeddings demonstrating the highest degree of human-cat representational alignment.

BibTeX

@misc{shah2025purrturbedstablehumancatinvariant,
      title={Purrturbed but Stable: Human-Cat Invariant Representations Across CNNs, ViTs and Self-Supervised ViTs}, 
      author={Arya Shah and Vaibhav Tripathi},
      year={2025},
      eprint={2511.02404},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.02404}, 
}

More Works from Our Lab

Related Works Coming Soon

Purrturbed but Stable: Human-Cat Invariant Representations Across CNNs, ViTs and Self-Supervised ViTs

A unified benchmark quantifying feline-human cross-species representational alignment across CNNs, Vision Transformers, and self-supervised ViTs using frozen encoders and layer-wise analysis.

Overview

Abstract

Methodology

Dataset

Biologically Informed Cat-Vision Filter

Models and Feature Extraction

Alignment Metrics

Results

Overall Performance

Family-Specific Observations

Layer-wise Dissimilarity Analysis

Additional Visualizations

CNN family embeddings with t-SNE and UMAP projections showing domain separation patterns.

Supervised Transformer embeddings revealing stronger cross-domain overlap compared to CNNs.

Self-supervised DINO model embeddings demonstrating the highest degree of human-cat representational alignment.

BibTeX