Neuroscience Basis For Hearing & Vision - 25
Neuroscience Basis for “Perceptual Quality Depends on Few Dimensions”
-
Human vision and hearing are most sensitive to low-frequency, structural, and semantic features, while high-frequency perturbations or extreme details are often ignored
-
In other words, only a small part of the overall signal information determines our subjective judgment of “realism” and “clarity”
Since our sensory organs and early neural circuits preferentially encode low‐to‐mid frequency, structural, and semantic features—while largely filtering out imperceptible high-frequency noise—generative models that concentrate capacity on those perceptually relevant dimensions achieve high subjective quality with far less computational cost
E.g.
Hybrid Approach
Many state-of-the-art systems embed prior knowledge of classical compression domains (such as DCT and subband decomposition) into learnable networks, which not only retains the interpretability and stability of manual design, but also uses machine learning to supplement details from data
Prior + Learning
-
- Prior (handcrafted) - provides a “good starting point”, narrows the model search space, and makes training more efficient and stable
- Prior (handcrafted) - provides a “good starting point”, narrows the model search space, and makes training more efficient and stable
-
- Learning (data-driven) - further fine-tuning or expansion within the prior framework allows the model to better adapt to the complexity of actual data
The reason why latent vectors are particularly effective is largely because human perception itself is "multi-scale". In the audio field:
1. Very fast amplitude changes make us feel pitch
2. Slower changes make us feel rhythm
In the visual field:
1. Local rapid color intensity fluctuations will be perceived by us as texture
2. Large-scale layout (such as object outline and structure) is the key information for us to truly identify "objects"
If the latent vector can automatically discard the "high-frequency texture" details that do not affect perception, but retain the "low-frequency structure" that determines the shape or rhythm of the object, it will not only conform to the human perception mechanism, but also greatly simplify the content that the generation model needs to learn.
A good latent vector will abstract all the subtle variations of “grass texture” into a simple “grass present” label, while keeping key structural information like eyes intact. In this way, the generative model only needs to learn the distribution of “grass present or not present”, without having to simulate endless grass blade noise.
Entropy in Perceptual Signals
For a random variable ( X ) (e.g., pixel intensity) with probability distribution ( p(x) ), its entropy is defined as:
\[H(X) = - \sum_x p(x)\,\log_2 p(x)\]- If a pixel’s value is nearly constant (e.g., the uniform blue of the sky), then $H(X) \approx 0$.
-
If a pixel’s value is uniformly distributed over many colors, then $H(X)$ approaches its maximum, $\log_{2}(\text{number of possible values})$.
- In a photograph of a dog in a field:
- The sky region has low entropy (neighboring pixels are easy to predict),
- The grass texture has high entropy (neighboring pixels vary unpredictably).
if entropy < threshold_low:
encode_as_constant_block(region)
if entropy > threshold_high:
abstract_as_texture_type(region)
📍 SparseMAE & Contrastive-Sparse - 25
class GazeContrastiveSparse:
def __init__(self):
self.contrastive_sparse = ContrastiveSparse()
self.gaze_augment = GazeAugmentation()
def train_on_gaze_data(self, gaze_sequence):
aug1 = self.gaze_augment.temporal_jitter(gaze_sequence)
aug2 = self.gaze_augment.spatial_noise(gaze_sequence)
loss, feat1, feat2 = self.contrastive_sparse(aug1, aug2)
key_fixations = self.extract_sparse_fixations(feat1)
return loss, key_fixations
def extract_sparse_fixations(self, features):
activation_mask = (torch.abs(features) > threshold)
sparse_indices = torch.nonzero(activation_mask)
return sparse_indices
Flow Matching for Temporal Consistency
📍 Neuroscience Perception Modeling - 25
- Biological inspiration from the human visual system:
- Multi-scale perception mechanism (fovea vs peripheral vision)
- Sparsity of attention (spotlight attention)
- Hierarchy of visual priorities
- Coupling relationship between eye movement and cognition
class PredictiveGazeInteraction:
def predict_user_next_action(self, gaze_sequence):
intention = self.extract_intention(gaze_sequence)
next_action = self.predict_action(intention)
self.prepare_for_action(next_action)
return next_action
class RealTimeEyeTracking:
def __init__(self):
self.latency_target = 1 # 1ms
self.sparse_processor = UltraSpareProcessor()
def process_gaze_realtime(self, eye_data):
sparse_features = self.sparse_processor(eye_data)
prediction = self.fast_predict(sparse_features)
return prediction
References
-
Tancik, Srinivasan, Mildenhall, Fridovich-Keil, Raghavan, Singhal, Ramamoorthi, Barron, Ng, “Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains”, Neural Information Processing Systems, 2020.
Enjoy Reading This Article?
Here are some more articles you might like to read next: