Neuroscience Basis For Hearing & Vision - 25
Neuroscience Basis for âPerceptual Quality Depends on Few Dimensionsâ
-
Human vision and hearing are most sensitive to low-frequency, structural, and semantic features, while high-frequency perturbations or extreme details are often ignored
-
In other words, only a small part of the overall signal information determines our subjective judgment of ârealismâ and âclarityâ
Since our sensory organs and early neural circuits preferentially encode lowâtoâmid frequency, structural, and semantic featuresâwhile largely filtering out imperceptible high-frequency noiseâgenerative models that concentrate capacity on those perceptually relevant dimensions achieve high subjective quality with far less computational cost
E.g.
The Generalization of Sampling
def generalized_sampling_theory():
Why downsampling always works in deep learning
Classical Nyquist theorem limitations
classical_nyquist = "f_sample â„ 2 Ă f_max" # Perfect reconstruction requires
# Deep learning's relaxed conditions
dl_sampling = {
"No need for perfect reconstruction",
"Only need to preserve task-relevant information",
"Tolerate controlled information loss",
"Compensate loss through learning"
}
# Effective sampling rate
# f_effective = f_original Ă (task_relevance_ratio)
# in most cases task_relevance_ratio = 0.1-0.3
# Usually task_relevance_ratio = 0.1-0.3
-> đ Task-oriented sampling more efficient than signal-oriented
The Essence of DownSampling
def downsampling_essence():
fundamental_principles = {
"Information_Theoretic_Foundation": {
"principle": "Natural signals have high intrinsic redundancy",
"quantification": "80-90% of information is redundant"
},
"Perceptual_Science": {
"principle": "Human perceptual systems have limitations",
"application": "Details beyond perceptual thresholds can be safely discarded"
},
"Statistical_Learning": {
"principle": "High-dimensional data lies on low-dimensional manifolds",
"implementation": "Encoders learn this dimensionality reduction mapping"
},
"Task_Optimization": {
"principle": "Only need to preserve task-relevant information",
"benefits": "Improve generalization, reduce overfitting"
},
"Computational_Reality": {
"principle": "Limited resources require tradeoffs",
"solution": "Downsampling provides optimal efficiency-performance balance"
}
}
return fundamental_principles
-> information reorganization and abstraction
Human Perceptual Limitations
# Human perceptual characteristics
perceptual_limits = {
"Vision": {
"temporal_resolution": "24-60 fps (excess is redundant)",
"spatial_resolution": "Huge central vs peripheral difference",
"color_resolution": "Green > Red > Blue sensitivity"
},
"Audition": {
"temporal_resolution": "2-3ms (10ms frames clearly oversample)",
"frequency_resolution": "Logarithmic scale, low high-freq resolution",
"loudness_perception": "Logarithmic, subtle differences imperceptible"
}
}
-> Encoders can safely discard imperceptible information
Hybrid Approach
Many state-of-the-art systems embed prior knowledge of classical compression domains (such as DCT and subband decomposition) into learnable networks, which not only retains the interpretability and stability of manual design, but also uses machine learning to supplement details from data
Prior + Learning
-
- Prior (handcrafted) - provides a âgood starting pointâ, narrows the model search space, and makes training more efficient and stable
- Prior (handcrafted) - provides a âgood starting pointâ, narrows the model search space, and makes training more efficient and stable
-
- Learning (data-driven) - further fine-tuning or expansion within the prior framework allows the model to better adapt to the complexity of actual data
The reason why latent vectors are particularly effective is largely because human perception itself is "multi-scale". In the audio field:
1. Very fast amplitude changes make us feel pitch
2. Slower changes make us feel rhythm
In the visual field:
1. Local rapid color intensity fluctuations will be perceived by us as texture
2. Large-scale layout (such as object outline and structure) is the key information for us to truly identify "objects"
If the latent vector can automatically discard the "high-frequency texture" details that do not affect perception, but retain the "low-frequency structure" that determines the shape or rhythm of the object, it will not only conform to the human perception mechanism, but also greatly simplify the content that the generation model needs to learn.
A good latent vector will abstract all the subtle variations of âgrass textureâ into a simple âgrass presentâ label, while keeping key structural information like eyes intact. In this way, the generative model only needs to learn the distribution of âgrass present or not presentâ, without having to simulate endless grass blade noise.
Entropy in Perceptual Signals
For a random variable ( X ) (e.g., pixel intensity) with probability distribution ( p(x) ), its entropy is defined as:
\[H(X) = - \sum_x p(x)\,\log_2 p(x)\]- If a pixelâs value is nearly constant (e.g., the uniform blue of the sky), then $H(X) \approx 0$.
-
If a pixelâs value is uniformly distributed over many colors, then $H(X)$ approaches its maximum, $\log_{2}(\text{number of possible values})$.
- In a photograph of a dog in a field:
- The sky region has low entropy (neighboring pixels are easy to predict),
- The grass texture has high entropy (neighboring pixels vary unpredictably).
if entropy < threshold_low:
encode_as_constant_block(region)
if entropy > threshold_high:
abstract_as_texture_type(region)
đ SparseMAE & Contrastive-Sparse - 25
class GazeContrastiveSparse:
def __init__(self):
self.contrastive_sparse = ContrastiveSparse()
self.gaze_augment = GazeAugmentation()
def train_on_gaze_data(self, gaze_sequence):
aug1 = self.gaze_augment.temporal_jitter(gaze_sequence)
aug2 = self.gaze_augment.spatial_noise(gaze_sequence)
loss, feat1, feat2 = self.contrastive_sparse(aug1, aug2)
key_fixations = self.extract_sparse_fixations(feat1)
return loss, key_fixations
def extract_sparse_fixations(self, features):
activation_mask = (torch.abs(features) > threshold)
sparse_indices = torch.nonzero(activation_mask)
return sparse_indices
Flow Matching for Temporal Consistency
đ Neuroscience Perception Modeling - 25
- Biological inspiration from the human visual system:
- Multi-scale perception mechanism (fovea vs peripheral vision)
- Sparsity of attention (spotlight attention)
- Hierarchy of visual priorities
- Coupling relationship between eye movement and cognition
class PredictiveGazeInteraction:
def predict_user_next_action(self, gaze_sequence):
intention = self.extract_intention(gaze_sequence)
next_action = self.predict_action(intention)
self.prepare_for_action(next_action)
return next_action
class RealTimeEyeTracking:
def __init__(self):
self.latency_target = 1 # 1ms
self.sparse_processor = UltraSpareProcessor()
def process_gaze_realtime(self, eye_data):
sparse_features = self.sparse_processor(eye_data)
prediction = self.fast_predict(sparse_features)
return prediction
References
-
Tancik, Srinivasan, Mildenhall, Fridovich-Keil, Raghavan, Singhal, Ramamoorthi, Barron, Ng, âFourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domainsâ, Neural Information Processing Systems, 2020.
Enjoy Reading This Article?
Here are some more articles you might like to read next: