CVPR-2023-Papers

CVPR 2023 Papers: Explore a comprehensive collection of cutting-edge research papers presented at CVPR 2023, the premier computer vision conference. Keep up to date with the latest advances in computer vision and deep learning. Code implementations included. ⭐ the repository for the development of visual intelligence!

Explore the CVPR 2023 online conference list with a comprehensive collection of accepted papers. Access additional resources such as PDFs, Supplementary Material, arXiv links and BibTeX citations for in-depth exploration of the research presented.

Other collections of the best AI conferences

❗ Conference table will be up to date all the time.

Conference	Year
Computer Vision (CV)
ICCV	2023
Speech (SP)
ICASSP	2023
INTERSPEECH	2023

Contributors

Contributions to improve the completeness of this list are greatly appreciated. If you come across any overlooked papers, please feel free to create pull requests, open issues or contact me via email. Your participation is crucial to making this repository even better.

Papers

List of sections

3D from Multi-View and Sensors
Image and Video Synthesis and Generation
Humans: Face, Body, Pose, Gesture, Movement
Transfer, Meta, Low-Shot, Continual, or Long-Tail Learning
Recognition: Categorization, Detection, Retrieval
Vision, Language, and Reasoning
Low-Level Vision
Segmentation, Grouping and Shape Analysis
Deep Learning Architectures and Techniques
Multi-Modal Learning
3D from Single Images
Medical and Biological Vision, Cell Microscopy
Video: Action and Event Understanding
Autonomous Driving
Self-Supervised or Unsupervised Representation Learning
Datasets and Evaluation
Scene Analysis and Understanding
Adversarial Attack and Defense
Efficient and Scalable Vision
Computational Imaging
Video: Low-Level Analysis, Motion, and Tracking
Vision Applications and Systems
Vision and Graphics
Robotics
Transparency, Fairness, Accountability, Privacy, Ethics in Vision
Explainable Computer Vision
Embodied Vision: Active Agents, Simulation
Document Analysis and Understanding
Machine Learning (other than Deep Learning)
Physics-based Vision and Shape-from-X
Biometrics
Optimization Methods (other than Deep Learning)
Photogrammetry and Remote Sensing
Computer Vision Theory
Computer Vision for Social Good
Others

3D from Multi-View and Sensors

Title	Repo	Video
NeuMap: Neural Coordinate Mapping by Auto-Transdecoder for Camera Localization
Object Pose Estimation with Statistical Guarantees: Conformal Keypoint Detection and Geometric Uncertainty Propagation
NeuralUDF: Learning Unsigned Distance Fields for Multi-View Reconstruction of Surfaces with Arbitrary Topologies
NEF: Neural Edge Fields for 3D Parametric Curve Reconstruction from Multi-View Images
Looking Through the Glass: Neural Surface Reconstruction Against High Specular Reflections
Multi-View Azimuth Stereo via Tangent Space Consistency		➖
Instant Multi-View Head Capture through Learnable Registration
EditableNeRF: Editing Topologically Varying Neural Radiance Fields by Key Points
Iterative Geometry Encoding Volume for Stereo Matching		➖
Hi-LASSIE: High-Fidelity Articulated Shape and Skeleton Discovery from Sparse Image Ensemble
VDN-NeRF: Resolving Shape-Radiance Ambiguity via View-Dependence Normalization
Neuralangelo: High-Fidelity Neural Surface Reconstruction
In-Hand 3D Object Scanning from an RGB Sequence
SHS-Net: Learning Signed Hyper Surfaces for Oriented Normal Estimation of Point Clouds
FAC: 3D Representation Learning via Foreground Aware Feature Contrast		➖
Neural Kernel Surface Reconstruction
NeRFVS: Neural Radiance Fields for Free View Synthesis via Geometry Scaffolds	➖	➖
HelixSurf: A Robust and Efficient Neural Implicit Surface Learning of Indoor Scenes with Iterative Intertwined Regularization
Multi-Space Neural Radiance Fields
MSF: Motion-guided Sequential Fusion for Efficient 3D Object Detection from Point Cloud Sequences		➖
PVO: Panoptic Visual Odometry
Diffusion-SDF: Text-to-Shape via Voxelized Diffusion
Rotation-Invariant Transformer for Point Cloud Matching
HexPlane: A Fast Representation for Dynamic Scenes
Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders		➖
Progressive Neighbor Consistency Mining for Correspondence Pruning
SCoDA: Domain Adaptive Shape Completion for Real Scans		➖
Adaptive Patch Deformation for Textureless-Resilient Multi-View Stereo
Level-S²fM: Structure from Motion on Neural Level Set of Implicit Surfaces
PLA: Language-Driven Open-Vocabulary 3D Scene Understanding
SUDS: Scalable Urban Dynamic Scenes
3D Semantic Segmentation in the Wild: Learning Generalized Models for Adverse-Condition Point Clouds
BAEFormer: Bi-Directional and Early Interaction Transformers for Bird's Eye View Semantic Segmentation	➖	➖
Dionysus: Recovering Scene Structures by Dividing into Semantic Pieces	➖	➖
LP-DIF: Learning Local Pattern-Specific Deep Implicit Function for 3D Objects and Scenes	➖	➖
Neural Kaleidoscopic Space Sculpting
Starting from Non-Parametric Networks for 3D Point Cloud Analysis
Panoptic Compositional Feature Field for Editable Scene Rendering with Network-Inferred Labels via Metric Learning	➖	➖
Robust Dynamic Radiance Fields
BAD-NeRF: Bundle Adjusted Deblur Neural Radiance Fields
Consistent Direct Time-of-Flight Video Depth Super-Resolution
Patch-based 3D Natural Scene Generation from a Single Example
3D Video Loops from Asynchronous Input
UniDistill: A Universal Cross-Modality Knowledge Distillation Framework for 3D Object Detection in Bird's-Eye View
Neural Scene Chronology
RUST: Latent Neural Scene Representations from Unposed Imagery		➖
Painting 3D Nature in 2D: View Synthesis of Natural Scenes from a Single Semantic Mask
F²-NeRF: Fast Neural Radiance Field Training with Free Camera Trajectories
VL-SAT: Visual-Linguistic Semantics Assisted Training for 3D Semantic Scene Graph Prediction in Point Cloud
REC-MV: REconstructing 3D Dynamic Cloth from Monocular Videos
MVImgNet: A Large-Scale Dataset of Multi-View Images		➖
Shakes on a Plane: Unsupervised Depth Estimation from Unstabilized Photography
GINA-3D: Learning to Generate Implicit Neural Assets in the Wild	➖
MobileNeRF: Exploiting the Polygon Rasterization Pipeline for Efficient Neural Field Rendering on Mobile Architectures
DynIBaR: Neural Dynamic Image-based Rendering
IMP: Iterative Matching and Pose Estimation with Adaptive Pooling
Learning the Distribution of Errors in Stereo Matching for Joint Disparity and Uncertainty Estimation
NeAT: Learning Neural Implicit Surfaces with Arbitrary Topologies from Multi-View Images
ShadowNeuS: Neural SDF Reconstruction by Shadow Ray Supervision
Bi-LRFusion: Bi-Directional LiDAR-Radar Fusion for 3D Dynamic Object Detection
NeRF-DS: Neural Radiance Fields for Dynamic Specular Objects
LoGoNet: Towards Accurate 3D Object Detection with Local-to-Global Cross-Modal Fusion		➖
3D Registration with Maximal Cliques
OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation
Progressive Spatio-Temporal Alignment for Efficient Event-based Motion Estimation
RefSR-NeRF: Towards High Fidelity and Super Resolution View Synthesis	➖	➖
NoPe-NeRF: Optimising Neural Radiance Field with No Pose Prior
Spherical Transformer for LiDAR-based 3D Recognition
Progressively Optimized Local Radiance Fields for Robust View Synthesis
PersonNeRF: Personalized Reconstruction from Photo Collections
NeuWigs: A Neural Dynamic Model for Volumetric Hair Capture and Animation
Representing Volumetric Videos as Dynamic MLP Maps
Rethinking the Approximation Error in 3D Surface Fitting for Point Cloud Normal Estimation
A Practical Stereo Depth System for Smart Glasses
Compressing Volumetric Radiance Fields to 1 MB
HyperReel: High-Fidelity 6-DoF Video with Ray-Conditioned Sampling
Point2Pix: Photo-Realistic Point Cloud Rendering via Neural Radiance Fields
Command-Driven Articulated Object Understanding and Manipulation
SCADE: NeRFs from Space Carving with Ambiguity-Aware Depth Estimates
PaletteNeRF: Palette-based Appearance Editing of Neural Radiance Fields
NeRFLiX: High-Quality Neural View Synthesis by Learning a Degradation-Driven Inter-Viewpoint MiXer
SegLoc: Learning Segmentation-based Representations for Privacy-Preserving Visual Localization
expOSE: Accurate Initialization-Free Projective Factorization using Exponential Regularization
Neural Vector Fields: Implicit Representation by Explicit Learning
Unsupervised Inference of Signed Distance Functions from Single Sparse Point Clouds without Learning Priors
Learning to Measure the Point Cloud Reconstruction Loss in a Representation Space
Grad-PU: Arbitrary-Scale Point Cloud Upsampling via Gradient Descent with Learned Distance Functions
TensoIR: Tensorial Inverse Rendering
Multi-View Inverse Rendering for Large-Scale Real-World Indoor Scenes
Frequency-Modulated Point Cloud Rendering with Easy Editing
VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking
RGBD2: Generative Scene Synthesis via Incremental View Inpainting using RGBD Diffusion Models
Multi-View Stereo Representation Revist: Region-Aware MVSNet
AutoRecon: Automated 3D Object Discovery and Reconstruction
Common Pets in 3D: Dynamic New-View Synthesis of Real-Life Deformable Categories
Binarizing Sparse Convolutional Networks for Efficient Point Cloud Analysis
LargeKernel3D: Scaling up Kernels in 3D Sparse CNNs
Learning 3D Scene Priors with 2D Supervision
NeuralEditor: Editing Neural Radiance Fields via Manipulating Point Clouds
NeuralPCI: Spatio-Temporal Neural Field for 3D Point Cloud Multi-Frame Non-Linear Interpolation
Two-View Geometry Scoring without Correspondences
Deep Graph-based Spatial Consistency for Robust Non-rigid Point Cloud Registration
RIAV-MVS: Recurrent-Indexing an Asymmetric Volume for Multi-View Stereo
Depth Estimation from Camera Image and mmWave Radar Point Cloud
Hierarchical Supervision and Shuffle Data Augmentation for 3D Semi-Supervised Object Detection
Multiscale Tensor Decomposition and Rendering Equation Encoding for View Synthesis
PATS: Patch Area Transportation with Subdivision for Local Feature Matching
Depth Estimation from Indoor Panoramas with Neural Scene Representation
Masked Representation Learning for Domain Generalized Stereo Matching
GANHead: Towards Generative Animatable Neural Head Avatars
Panoptic Lifting for 3D Scene Understanding with Neural Fields
Visual-Tactile Sensing for In-Hand Object Reconstruction
IterativePFN: True Iterative Point Cloud Filtering
Towards Better Gradient Consistency for Neural Signed Distance Functions via Level Set Alignment
GarmentTracking: Category-Level Garment Pose Tracking
Learning Transformation-Predictive Representations for Detection and Description of Local Features
Local Implicit Ray Function for Generalizable Radiance Field Representation
Grid-guided Neural Radiance Fields for Large Urban Scenes
EventNeRF: Neural Radiance Fields from a Single Colour Event Camera
Learning Optical Expansion from Scale Matching
Self-Supervised 3D Scene Flow Estimation Guided by Superpoints
Adaptive Annealing for Robust Geometric Estimation
SurfelNeRF: Neural Surfel Radiance Fields for Online Photorealistic Reconstruction of Indoor Scenes
PlaneDepth: Self-Supervised Depth Estimation via Orthogonal Planes
High-Res Facial Appearance Capture from Polarized Smartphone Images
Tensor4D: Efficient Neural 4D Decomposition for High-Fidelity Dynamic Reconstruction and Rendering
Fully Self-Supervised Depth Estimation from Defocus Clue
Adaptive Assignment for Geometry Aware Local Feature Matching
Efficient Second-Order Plane Adjustment
Learning Adaptive Dense Event Stereo from the Image Domain
FreeNeRF: Improving Few-Shot Neural Rendering with Free Frequency Regularization
SteerNeRF: Accelerating NeRF Rendering via Smooth Viewpoint Trajectory
Cross-guided Optimization of Radiance Fields with Multi-View Image Super-Resolution for High-Resolution Novel View Synthesis
AeDet: Azimuth-Invariant Multi-View 3D Object Detection
Local-to-Global Registration for Bundle-Adjusting Neural Radiance Fields
DKM: Dense Kernelized Feature Matching for Geometry Estimation
DINER: Depth-Aware Image-based NEural Radiance fields
HGNet: Learning Hierarchical Geometry from Points, Edges, and Surfaces
Instant Volumetric Head Avatars
3D Line Mapping Revisited
Learning to Fuse Monocular and Multi-View Cues for Multi-Frame Depth Estimation in Dynamic Scenes
ESLAM: Efficient Dense SLAM System based on Hybrid Representation of Signed Distance Fields
Point Cloud Forecasting as a Proxy for 4D Occupancy Forecasting
SparsePose: Sparse-View Camera Pose Regression and Refinement
Controllable Mesh Generation through Sparse Latent Point Diffusion Models
ARO-Net: Learning Implicit Fields from Anchored Radial Observations
Semantic Ray: Learning a Generalizable Semantic Field with Cross-Reprojection Attention
Sphere-guided Training of Neural Implicit Surfaces
Finding Geometric Models by Clustering in the Consensus Space
NeuralDome: A Neural Modeling Pipeline on Multi-View Human-Object Interactions
Privacy-Preserving Representations are not Enough: Recovering Scene Content from Camera Poses
Robust Multiview Point Cloud Registration with Reliable Pose Graph Initialization and History Reweighting
Neural Part Priors: Learning to Optimize Part-based Object Completion in RGB-D Scans
Accelerated Coordinate Encoding: Learning to Relocalize in Minutes using RGB and Poses
Gated Stereo: Joint Depth Estimation from Gated and Wide-Baseline Active Stereo Cues
Revisiting Rotation Averaging: Uncertainties and Robust Losses
NeRF-Supervised Deep Stereo
POEM: Reconstructing Hand in a Point Embedded Multi-View Stereo
vMAP: Vectorised Object Mapping for Neural Field SLAM
PET-NeuS: Positional Encoding Tri-Planes for Neural Surfaces
Learnable Skeleton-Aware 3D Point Cloud Sampling
ObjectMatch: Robust Registration using Canonical Object Correspondences
DiffRF: Rendering-guided 3D Radiance Field Diffusion
Learning a Depth Covariance Function
Viewpoint Equivariance for Multi-View 3D Object Detection
BlendFields: Few-Shot Example-Driven Facial Modeling
Implicit Surface Contrastive Clustering for LiDAR Point Clouds
Self-Supervised Super-Plane for Neural 3D Reconstruction
DiffusioNeRF: Regularizing Neural Radiance Fields with Denoising Diffusion Models
AligNeRF: High-Fidelity Neural Radiance Fields via Alignment-Aware Training
VisFusion: Visibility-Aware Online 3D Scene Reconstruction from Videos
Fast Monocular Scene Reconstruction with Global-Sparse Local-Dense Grids
Semi-Weakly Supervised Object Kinematic Motion Prediction
OmniVidar: Omnidirectional Depth Estimation from Multi-Fisheye Images
ContraNeRF: Generalizable Neural Radiance Fields for Synthetic-to-Real Novel View Synthesis via Contrastive Learning
PointVector: A Vector Representation In Point Cloud Analysis
Poly-PC: A Polyhedral Network for Multiple Point Cloud Tasks at Once
Learning Neural Duplex Radiance Fields for Real-Time View Synthesis
VolRecon: Volume Rendering of Signed Ray Distance Functions for Generalizable Multi-View Reconstruction
CompletionFormer: Depth Completion with Convolutions and Vision Transformers
Exact-NeRF: An Exploration of a Precise Volumetric Parameterization for Neural Radiance Fields
Collaboration Helps Camera Overtake LiDAR in 3D Detection
SPIn-NeRF: Multiview Segmentation and Perceptual Inpainting with Neural Radiance Fields
GeoMVSNet: Learning Multi-View Stereo with Geometry Perception
3D Shape Reconstruction of Semi-Transparent Worms
Revisiting Rolling Shutter Bundle Adjustment: Toward Accurate and Fast Solution
Virtual Occlusions through Implicit Depth
Neural Fields meet Explicit Geometric Representations for Inverse Rendering of Urban Scenes
Building Rearticulable Models for Arbitrary 3D Objects from 4D Point Clouds
DynamicStereo: Consistent Dynamic Depth from Stereo Videos
Robust Outlier Rejection for 3D Registration with Variational Bayes
Meta Architecture for Point Cloud Analysis
DyLiN: Making Light Field Networks Dynamic
Domain Generalized Stereo Matching via Hierarchical Visual Transformation
Learning Visibility Field for Detailed 3D Human Reconstruction and Relighting
LightedDepth: Video Depth Estimation in Light of Limited Inference View Angles
Long-Term Visual Localization with Mobile Sensors
Revisiting the P3P Problem
I²-SDF: Intrinsic Indoor Scene Reconstruction and Editing via Raytracing in Neural SDFs
WildLight: In-the-Wild Inverse Rendering with a Flashlight
SE-ORNet: Self-Ensembling Orientation-Aware Network for Unsupervised Point Cloud Shape Correspondence
Teleidoscopic Imaging System for Microscale 3D Shape Reconstruction
NeuDA: Neural Deformable Anchor for High-Fidelity Implicit Surface Reconstruction
PointClustering: Unsupervised Point Cloud Pre-Training using Transformation Invariance in Clustering
PermutoSDF: Fast Multi-View Reconstruction with Implicit Surfaces using Permutohedral Lattices
TriVol: Point Cloud Rendering via Triple Volumes
Towards Unbiased Volume Rendering of Neural Implicit Surfaces with Geometry Priors
Semi-Supervised Stereo-based 3D Object Detection via Cross-View Consensus
Self-Supervised Pre-Training with Masked Shape Prediction for 3D Scene Understanding
Octree Guided Unoriented Surface Reconstruction
Towards Domain Generalization for Multi-View 3D Object Detection in Bird-Eye-View
Learning Neural Volumetric Representations of Dynamic Humans in Minutes
AnchorFormer: Point Cloud Completion from Discriminative Nodes
Transforming Radiance Field with Lipschitz Network for Photorealistic 3D Scene Stylization
GANmouflage: 3D Object Nondetection with Texture Fields
PEAL: Prior-embedded Explicit Attention Learning for Low-Overlap Point Cloud Registration
NeRFLight: Fast and Light Neural Radiance Fields using a Shared Feature Grid
TMO: Textured Mesh Acquisition of Objects with a Mobile Device by using Differentiable Rendering
Generating Part-Aware Editable 3D Shapes without 3D Supervision
ALTO: Alternating Latent Topologies for Implicit 3D Reconstruction
ORCa: Glossy Objects as Radiance-Field Cameras
NeFII: Inverse Rendering for Reflectance Decomposition with Near-Field Indirect Illumination
BEV-guided Multi-Modality Fusion for Driving Perception
K-Planes: Explicit Radiance Fields in Space, Time, and Appearance
RobustNeRF: Ignoring Distractors with Robust Losses
Unsupervised Deep Asymmetric Stereo Matching with Spatially-Adaptive Self-Similarity
ProxyFormer: Proxy Alignment Assisted Point Cloud Completion with Missing Part Sensitive Transformer
Diffusion-based Signed Distance Fields for 3D Shape Generation
FeatureBooster: Boosting Feature Descriptors with a Lightweight Neural Network
Temporal Interpolation is All You Need for Dynamic Neural Radiance Fields
Neural Lens Modeling
Multi-View Reconstruction using Signed Ray Distance Functions (SRDF)
Masked Wavelet Representation for Compact Neural Radiance Fields
A Rotation-Translation-Decoupled Solution for Robust and Efficient Visual-Inertial Initialization
MACARONS: Mapping and Coverage Anticipation with RGB Online Self-Supervision
DualRefine: Self-Supervised Depth and Pose Estimation through Iterative Epipolar Sampling and Refinement Toward Equilibrium
CLIP²: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data
Semidefinite Relaxations for Robust Multiview Triangulation
High-Frequency Stereo Matching Network
CAP: Robust Point Cloud Classification via Semantic and Structural Modeling
Co-SLAM: Joint Coordinate and Sparse Parametric Encodings for Neural Real-Time SLAM
Temporally Consistent Online Depth Estimation using Point-based Fusion
Learning Neural Parametric Head Models
PointConvFormer: Revenge of the Point-based Convolution
Four-View Geometry with Unknown Radial Distortion
Seeing through the Glass: Neural 3D Reconstruction of Object Inside a Transparent Container

Image and Video Synthesis and Generation

Title	Repo	Paper	Video
Towards Universal Fake Image Detectors that Generalize Across Generative Models
Implicit Diffusion Models for Continuous Super-Resolution
High-Fidelity Guided Image Synthesis with Latent Diffusion Models
DBARF: Deep Bundle-Adjusting Generalizable Neural Radiance Fields
Deep Arbitrary-Scale Image Super-Resolution via Scale-Equivariance Pursuit
Balanced Spherical Grid for Egocentric View Synthesis
SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
Self-guided Diffusion Models
Multi-Concept Customization of Text-to-Image Diffusion
3D-Aware Conditional Image Synthesis
QuantArt: Quantizing Image Style Transfer Towards High Visual Fidelity
SceneComposer: Any-Level Semantic Image Synthesis
DiffCollage: Parallel Generation of Large Content with Diffusion Models
Putting People in Their Place: Affordance-Aware Human Insertion into Scenes
Hybrid Neural Rendering for Large-Scale Scenes with Motion Blur
Binary Latent Diffusion
StyleRes: Transforming the Residuals for Real Image Editing with StyleGAN
KD-DLGAN: Data Limited Image Generation via Knowledge Distillation
SeaThru-NeRF: Neural Radiance Fields in Scattering Media
PointAvatar: Deformable Point-based Head Avatars from Videos
3DAvatarGAN: Bridging Domains for Personalized Editable Avatars
Neural Preset for Color Style Transfer
Zero-Shot Generative Model Adaptation via Image-Specific Prompt Learning
DyNCA: Real-Time Dynamic Texture Synthesis using Neural Cellular Automata
Exploring Incompatible Knowledge Transfer in Few-Shot Image Generation
HouseDiffusion: Vector Floorplan Generation via a Diffusion Model with Discrete and Continuous Denoising
Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization
RiDDLE: Reversible and Diversified De-Identification with Latent Encryptor
LayoutDiffusion: Controllable Diffusion Model for Layout-to-Image Generation
LipFormer: High-Fidelity and Generalizable Talking Face Generation with A Pre-learned Facial Codebook
Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation
GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis
High-Fidelity Generalized Emotional Talking Face Generation with Multi-Modal Emotion Space Learning
Consistent View Synthesis with Pose-guided Diffusion Models
StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-based Generator
Imagic: Text-based Real Image Editing with Diffusion Models
Large-Capacity and Flexible Video Steganography via Invertible Neural Network
Quantitative Manipulation of Custom Attributes on 3D-Aware Image Synthesis
Learning Detailed Radiance Manifolds for High-Fidelity and 3D-Consistent Portrait Synthesis from Monocular Image
CF-Font: Content Fusion for Few-Shot Font Generation
One-Shot High-Fidelity Talking-Head Synthesis with Deformable Neural Radiance Field
Unsupervised Domain Adaption with Pixel-Level Discriminator for Image-Aware Layout Generation
Diffusion Probabilistic Model Made Slim
Collaborative Diffusion for Multi-Modal Face Generation and Editing
High-Fidelity Facial Avatar Reconstruction from Monocular Video with Generative Priors
Network-Free, Unsupervised Semantic Segmentation with Synthetic Images
Visual Prompt Tuning for Generative Transfer Learning
Specialist Diffusion: Plug-and-Play Sample-Efficient Fine-Tuning of Text-to-Image Diffusion Models to Learn Any Unseen Style
Catch Missing Details: Image Reconstruction with Frequency Augmented Variational Autoencoder
Towards Bridging the Performance Gaps of Joint Energy-based Models
GLeaD: Improving GANs with a Generator-Leading Task
Structural Multiplane Image: Bridging Neural View Synthesis and 3D Reconstruction
SPARF: Neural Radiance Fields from Sparse and Noisy Poses
DeltaEdit: Exploring Text-Free Training for Text-Driven Image Manipulation
Inferring and Leveraging Parts from Object Shape for Improving Semantic Image Synthesis
VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation
MaskSketch: Unpaired Structure-guided Masked Image Generation
Affordance Diffusion: Synthesizing Hand-Object Interactions
Interactive Cartoonization with Controllable Perceptual Factors
MetaPortrait: Identity-Preserving Talking Head Generation with Fast Personalized Adaptation
Paint by Example: Exemplar-based Image Editing with Diffusion Models
GLIGEN: Open-Set Grounded Text-to-Image Generation
L-CoIns: Language-based Colorization with Instance Awareness
DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation
Evading DeepFake Detectors via Adversarial Statistical Consistency
GlassesGAN: Eyewear Personalization using Synthetic Appearance Discovery and Targeted Subspace Modeling
GP-VTON: Towards General Purpose Virtual Try-on via Collaborative Local-Flow Global-Parsing Learning
Where is My Spot? Few-Shot Image Generation via Latent Subspace Optimization
Regularized Vector Quantization for Tokenized Image Synthesis
EDICT: Exact Diffusion Inversion via Coupled Transformations
Scaling up GANs for Text-to-Image Synthesis
Shape-Aware Text-Driven Layered Video Editing
A Unified Pyramid Recurrent Network for Video Frame Interpolation
TAPS3D: Text-guided 3D Textured Shape Generation from Pseudo Supervision
Fine-grained Face Swapping via Regional GAN Inversion
OTAvatar: One-Shot Talking Face Avatar with Controllable Tri-Plane Rendering
Deep Stereo Video Inpainting
StyleGAN Salon: Multi-View Latent Optimization for Pose-Invariant Hairstyle Transfer
Cross-GAN Auditing: Unsupervised Identification of Attribute Level Similarities and Differences between Pretrained Generative Models
Unsupervised Volumetric Animation
SINE: SINgle Image Editing with Text-to-Image Diffusion Models
Progressive Disentangled Representation Learning for Fine-grained Controllable Talking Head Synthesis
CAP-VSTNet: Content Affinity Preserved Versatile Style Transfer
DeepVecFont-v2: Exploiting Transformers to Synthesize Vector Fonts with Higher Quality
LEMaRT: Label-Efficient Masked Region Transform for Image Harmonization
SINE: Semantic-Driven Image-based NeRF Editing with Prior-guided Editing Field
Exploring Intra-Class Variation Factors with Learnable Cluster Prompts for Semi-Supervised Image Synthesis
Image Cropping with Spatial-Aware Feature and Rank Consistency
Picture that Sketch: Photorealistic Image Generation from Abstract Sketches
MonoHuman: Animatable Human Neural Field from Monocular Video
PixHt-Lab: Pixel Height based Light Effect Generation for Image Compositing
Neural Pixel Composition for 3D-4D View Synthesis from Multi-Views
SpaText: Spatio-Textual Representation for Controllable Image Generation
Exploring Motion Ambiguity and Alignment for High-Quality Video Frame Interpolation
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
Synthesizing Photorealistic Virtual Humans Through Cross-Modal Disentanglement
Video Probabilistic Diffusion Models in Projected Latent Space
Variational Distribution Learning for Unsupervised Text-to-Image Generation
Linking Garment with Person via Semantically Associated Landmarks for Virtual Try-On
UV Volumes for Real-Time Rendering of Editable Free-View Human Performance
Null-Text Inversion for Editing Real Images using Guided Diffusion Models
Polynomial Implicit Neural Representations for Large Diverse Datasets
Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation
Conditional Image-to-Video Generation with Latent Flow Diffusion Models
Local 3D Editing via 3D Distillation of CLIP Knowledge
Private Image Generation with Dual-Purpose Auxiliary Classifier
MAGVIT: Masked Generative Video Transformer
Dimensionality-Varying Diffusion Process
VIVE3D: Viewpoint-Independent Video Editing using 3D-Aware GANs
LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data
DATID-3D: Diversity-Preserved Domain Adaptation using Text-to-Image Diffusion for 3D Generative Model
Delving StyleGAN Inversion for Image Editing: A Foundation Latent Space Viewpoint
High-Fidelity and Freely Controllable Talking Head Video Generation
SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
StyleRF: Zero-Shot 3D Style Transfer of Neural Radiance Fields
MOSO: Decomposing MOtion, Scene and Object for Video Prediction
Multi Domain Learning for Motion Magnification
GazeNeRF: 3D-Aware Gaze Redirection with Neural Radiance Fields
Hierarchical B-frame Video Coding using Two-Layer CANF without Motion Coding
Blemish-Aware and Progressive Face Retouching with Limited Paired Data
Text-guided Unsupervised Latent Transformation for Multi-attribute Image Manipulation
NeuralField-LDM: Scene Generation with Hierarchical Latent Diffusion Models
Fix the Noise: Disentangling Source Feature for Controllable Domain Translation
Class-Balancing Diffusion Models
DPE: Disentanglement of Pose and Expression for General Video Portrait Editing
Inversion-based Style Transfer with Diffusion Models
Deep Curvilinear Editing: Commutative and Nonlinear Image Manipulation for Pretrained Deep Generative Model
FlowGrad: Controlling the Output of Generative ODEs with Gradients
Graph Transformer GANs for Graph-Constrained House Generation
Master: Meta Style Transformer for Controllable Zero-Shot and Few-Shot Artistic Style Transfer
Next3D: Generative Neural Texture Rasterization for 3D-Aware Head Avatars
Ham2Pose: Animating Sign Language Notation into Pose Sequences
Neural Transformation Fields for Arbitrary-Styled Font Generation
LayoutDM: Transformer-based Diffusion Model for Layout Generation
Removing Objects from Neural Radiance Fields
Person Image Synthesis via Denoising Diffusion Model
AdaptiveMix: Improving GAN Training via Feature Space Shrinkage
Learning Joint Latent Space EBM Prior Model for Multi-Layer Generator
3D Neural Field Generation using Triplane Diffusion
OmniAvatar: Geometry-guided Controllable 3D Head Synthesis
RWSC-Fusion: Region-Wise Style-Controlled Fusion Network for the Prohibited X-ray Security Image Synthesis
ObjectStitch: Object Compositing with Diffusion Model
Persistent Nature: A Generative Model of Unbounded 3D Worlds
Masked and Adaptive Transformer for Exemplar based Image Translation
Spider GAN: Leveraging Friendly Neighbors to Accelerate GAN Training
Re-IQA: Unsupervised Learning for Image Quality Assessment in the Wild
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
All are Worth Words: A ViT Backbone for Diffusion Models
Few-Shot Semantic Image Synthesis with Class Affinity Transfer
Blowing in the Wind: CycleNet for Human Cinemagraphs from Still Images
StyleGene: Crossover and Mutation of Region-Level Facial Genes for Kinship Face Synthesis
MixNeRF: Modeling a Ray with Mixture Density for Novel View Synthesis from Sparse Inputs
MoStGAN-V: Video Generation with Temporal Motion Styles
Frame Interpolation Transformer and Uncertainty Guidance
Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers
HOLODIFFUSION: Training a 3D Diffusion Model using 2D Images
Neural Texture Synthesis with Guided Correspondence
PanoHead: Geometry-Aware 3D Full-Head Synthesis in 360°
InstructPix2Pix: Learning to Follow Image Editing Instructions
Unpaired Image-to-Image Translation with Shortest Path Regularization
Freestyle Layout-to-Image Synthesis
On Distillation of Guided Diffusion Models
Single Image Backdoor Inversion via Robust Smoothed Classifiers
Make-a-Story: Visual Memory Conditioned Consistent Story Generation
Towards Practical Plug-and-Play Diffusion Models
Efficient Scale-Invariant Generator with Column-Row Entangled Pixel Synthesis
Wavelet Diffusion Models are Fast and Scalable Image Generators
3D GAN Inversion with Facial Symmetry Prior
Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert
PCT-Net: Full Resolution Image Harmonization Using Pixel-Wise Color Transformations
ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts
Video Compression with Entropy-Constrained Neural Representations
Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models
CoralStyleCLIP: Co-optimized Region and Layer Selection for Image Editing
Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding
Sequential Training of GANs Against GAN-classifiers Reveals Correlated `Knowledge Gaps` Present among Independently Trained GAN Instances
Attribute-Preserving Face Dataset Anonymization via Latent Code Optimization
Shifted Diffusion for Text-to-Image Generation
HandsOff: Labeled Dataset Generation with no Additional Human Annotations
Lookahead Diffusion Probabilistic Models for Refining Mean Estimation
Imagen Editor and EditBench: Advancing and Evaluating Text-guided Image Inpainting
Re-GAN: Data-Efficient GANs Training via Architectural Reconfiguration
BBDM: Image-to-Image Translation with Brownian Bridge Diffusion Models
VectorFusion: Text-to-SVG by Abstracting Pixel-based Diffusion Models

Humans: Face, Body, Pose, Gesture, Movement

Title	Repo	Paper	Video
Micron-BERT: BERT-based Facial Micro-Expression Recognition
NIKI: Neural Inverse Kinematics with Invertible Neural Networks for 3D Human Pose and Shape Estimation
A Characteristic Function-based Method for Bottom-Up Human Pose Estimation
Executing your Commands via Motion Diffusion in Latent Space
MSINet: Twins Contrastive Search of Multi-Scale Interaction for Object ReID
Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation
Global-to-Local Modeling for Video-based 3D Human Pose and Shape Estimation
Dynamic Aggregated Network for Gait Recognition
Object Pop-Up: Can We Infer 3D Objects and Their Poses from Human Interactions Alone?
Unsupervised Sampling Promoting for Stochastic Human Trajectory Prediction
ECON: Explicit Clothed humans Optimized via Normal integration
Neuron Structure Modeling for Generalizable Remote Physiological Measurement
Continuous Sign Language Recognition with Correlation Network
Parametric Implicit Face Representation for Audio-Driven Facial Reenactment
CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model
PoseExaminer: Automated Testing of Out-of-Distribution Robustness in Human Pose and Shape Estimation
3D Human Mesh Estimation from Virtual Markers
3D Human Pose Estimation via Intuitive Physics
ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation
Generating Holistic 3D Human Motion from Speech
HARP: Personalized Hand Reconstruction from a Monocular RGB Video
Learning Locally Editable Virtual Humans
Reconstructing Signing Avatars from Video using Linguistic Priors
DrapeNet: Garment Generation and Self-Supervised Draping
X-Avatar: Expressive Human Avatars
Hi4D: 4D Instance Segmentation of Close Human Interaction
Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild via Self-Supervised Scene Decomposition
CloSET: Modeling Clothed Humans on Continuous Surface with Explicit Template Decomposition
Graphics Capsule: Learning Hierarchical 3D Face Representations from 2D Images
Rethinking the Learning Paradigm for Dynamic Facial Expression Recognition
HandNeRF: Neural Radiance Fields for Animatable Interacting Hands
Relightable Neural Human Assets from Multi-View Gradient Illuminations
Being Comes from Not-being: Open-Vocabulary Text-to-Motion Generation with Wordless Training
DeFeeNet: Consecutive 3D Human Motion Prediction with Deviation Feedback
BioNet: A Biologically-Inspired Network for Face Recognition
Boosting Detection in Crowd Analysis via Underutilized Output Features
Learning Analytical Posterior Probability for Human Mesh Recovery
Listening Human Behavior: 3D Human Pose Estimation with Acoustic Signals
Detecting and Grounding Multi-Modal Media Manipulation
RelightableHands: Efficient Neural Relighting of Articulated Hand Models
MEGANE: Morphable Eyeglass and Avatar Network
SunStage: Portrait Reconstruction and Relighting using the Sun as a Light Stage
TryOnDiffusion: A Tale of Two UNets
Semi-Supervised Hand Appearance Recovery via Structure Disentanglement and Dual Adversarial Discrimination
POTTER: Pooling Attention Transformer for Efficient Human Mesh Recovery
Scene-Aware Egocentric 3D Human Pose Estimation
PSVT: End-to-End Multi-Person 3D Pose and Shape Estimation with Progressive Video Transformers
Trajectory-Aware Body Interaction Transformer for Multi-Person Pose Forecasting
A2J-Transformer: Anchor-to-Joint Transformer Network for 3D Interacting Hand Pose Estimation from a Single RGB Image
TRACE: 5D Temporal Regression of Avatars with Dynamic Cameras in 3D Environments
Skinned Motion Retargeting with Residual Perception of Motion Semantics & Geometry
Generating Human Motion from Textual Descriptions with Discrete Representations
Learning Human Mesh Recovery in 3D Scenes
AVFace: Towards Detailed Audio-Visual 4D Face Reconstruction
3D-Aware Face Swapping
Neural Residual Radiance Fields for Streamably Free-Viewpoint Videos
GFPose: Learning 3D Human Pose Prior with Gradient Fields
Rethinking Feature-based Knowledge Distillation for Face Recognition
One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer
Towards Stable Human Pose Estimation via Cross-View Fusion and Foot Stabilization
Ego-Body Pose Estimation via Ego-Head Pose Estimation
TOPLight: Lightweight Neural Networks with Task-Oriented Pretraining for Visible-Infrared Recognition
StyleIPSB: Identity-Preserving Semantic Basis of StyleGAN for High Fidelity Face Swapping
Improving Fairness in Facial Albedo Estimation via Visual-Textual Cues
FLEX: Full-Body Grasping without Full-Body Grasps
EDGE: Editable Dance Generation From Music
Complete 3D Human Reconstruction from a Single Incomplete Image
Zero-Shot Pose Transfer for Unrigged Stylized 3D Characters
Hand Avatar: Free-Pose Hand Animation and Rendering from Monocular Video
Human-Art: A Versatile Human-Centric Dataset Bridging Natural and Artificial Scenes
Learning Neural Proto-Face Field for Disentangled 3D Face Modeling in the Wild
CLAMP: Prompt-based Contrastive Learning for Connecting Language and Animal Pose
Invertible Neural Skinning
DiffusionRig: Learning Personalized Priors for Facial Appearance Editing
Harmonious Feature Learning for Interactive Hand-Object Pose Estimation
Leapfrog Diffusion Model for Stochastic Trajectory Prediction
NeuFace: Realistic 3D Neural Face Rendering from Multi-View Images
DiffSwap: High-Fidelity and Controllable Face Swapping via 3D-Aware Masked Diffusion
GFIE: A Dataset and Baseline for Gaze-Following from 2D to 3D in Indoor Environments
Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos
Decompose more and Aggregate Better: Two Closer Looks at Frequency Representation Learning for Human Motion Prediction
Human Pose as Compositional Tokens
Normal-guided Garment UV Prediction for Human Re-Texturing
Dynamic Graph Learning with Content-guided Spatial-Frequency Relation Reasoning for Deepfake Detection
VGFlow: Visibility Guided Flow Network for Human Reposing
Mutual Information-based Temporal Difference Learning for Human Pose Estimation in Video
PREIM3D: 3D Consistent Precise Image Attribute Editing from a Single Image
HuManiFlow: Ancestor-Conditioned Normalising Flows on SO(3) Manifolds for Human Pose and Shape Distribution Estimation
Implicit Identity Driven Deepfake Face Swapping Detection
Trace and Pace: Controllable Pedestrian Animation via Guided Trajectory Diffusion
3D-Aware Facial Landmark Detection via Multi-View Consistent Training on Synthetic Data
SLOPER4D: A Scene-Aware Dataset for Global 4D Human Pose Estimation in Urban Environments
Zero-Shot Text-to-Parameter Translation for Game Character Auto-Creation
AssemblyHands: Towards Egocentric Activity Understanding via 3D Hand Pose Estimation
UDE: A Unified Driving Engine for Human Motion Generation
CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior
Semi-Supervised 2D Human Pose Estimation Driven by Position Inconsistency Pseudo Label Correction Module
Learning Personalized High Quality Volumetric Head Avatars from Monocular RGB Videos
HOOD: Hierarchical Graphs for Generalized Modelling of Clothing Dynamics
ACR: Attention Collaboration-based Regressor for Arbitrary Two-Hand Reconstruction
HumanBench: Towards General Human-Centric Perception with Projector Assisted Pretraining
CIMI4D: A Large Multimodal Climbing Motion Dataset under Human-Scene Interactions
Human Pose Estimation in Extremely Low-Light Conditions
DistilPose: Tokenized Pose Regression with Heatmap Distillation
Human Body Shape Completion with Implicit Shape and Flow Learning
Source-Free Adaptive Gaze Estimation by Uncertainty Reduction
Music-Driven Group Choreography
Robust Model-based Face Reconstruction through Weakly-Supervised Outlier Segmentation
MARLIN: Masked Autoencoder for Facial Video Representation LearnINg
Transformer-based Unified Recognition of Two Hands Manipulating Objects
Implicit Identity Leakage: The Stumbling Block to Improving Deepfake Detection Generalization
ScarceNet: Animal Pose Estimation with Scarce Annotations
FFHQ-UV: Normalized Facial UV-Texture Dataset for 3D Face Reconstruction
MoDi: Unconditional Motion Synthesis from Diverse Data
Feature Representation Learning with Adaptive Displacement Generation and Transformer Fusion for Micro-Expression Recognition
MeMaHand: Exploiting Mesh-Mano Interaction for Single Image Two-Hand Reconstruction
Stimulus Verification is a Universal and Effective Sampler in Multi-Modal Human Trajectory Prediction
TokenHPE: Learning Orientation Tokens for Efficient Head Pose Estimation via Transformers
Handy: Towards a High Fidelity 3D Hand Shape and Appearance Model
CIRCLE: Capture in Rich Contextual Environments
Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention
Implicit Neural Head Synthesis via Controllable Local Deformation Fields
Continuous Intermediate Token Learning with Implicit Motion Manifold for Keyframe based Motion Interpolation
JRDB-Pose: A Large-Scale Dataset for Multi-Person Pose Estimation and Tracking
STAR Loss: Reducing Semantic Ambiguity in Facial Landmark Detection
GM-NeRF: Learning Generalizable Model-based Neural Radiance Fields from Multi-View Images
Decoupled Multimodal Distilling for Emotion Recognition
HaLP: Hallucinating Latent Positives for Skeleton-based Self-Supervised Learning of Actions
ReDirTrans: Latent-to-Latent Translation for Gaze and Head Redirection
QPGesture: Quantization-based and Phase-guided Motion Matching for Natural Speech-Driven Gesture Generation
Multi-Modal Gait Recognition via Effective Spatial-Temporal Feature Fusion
Probabilistic Knowledge Distillation of Face Ensembles
Learning Semantic-Aware Disentangled Representation for Flexible 3D Human Body Editing
Parameter Efficient Local Implicit Image Function Network for Face Segmentation
HumanGen: Generating Human Radiance Fields with Explicit Priors
Biomechanics-guided Facial Action Unit Detection through Force Modeling
Decoupling Human and Camera Motion from Videos in the Wild
Overcoming the Trade-Off Between Accuracy and Plausibility in 3D Hand Shape Reconstruction
Instant-NVR: Instant Neural Volumetric Rendering for Human-Object Interactions from Monocular RGBD Stream
PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation
Analyzing and Diagnosing Pose Estimation with Attributions
Unsupervised Visible-Infrared Person Re-Identification via Progressive Graph Matching and Alternate Learning
Shape-Erased Feature Learning for Visible-Infrared Person Re-Identification
Distilling Cross-Temporal Contexts for Continuous Sign Language Recognition
Avatars Grow Legs: Generating Smooth Human Motion from Sparse Tracking Inputs with Diffusion Model
Local Connectivity-based Density Estimation for Face Clustering
SelfME: Self-Supervised Motion Learning for Micro-Expression Recognition
Detecting Human-Object Contact in Images
Controllable Light Diffusion for Portraits
InstantAvatar: Learning Avatars from Monocular Video in 60 Seconds
NeMo: 3D Neural Motion Fields from Multiple Video Instances of the Same Action
Privacy-Preserving Adversarial Facial Features
Self-Correctable and Adaptable Inference for Generalizable Human Pose Estimation
DSFNet: Dual Space Fusion Network for Occlusion-Robust 3D Dense Face Alignment
Clothed Human Performance Capture with a Double-Layer Neural Radiance Fields
Continuous Landmark Detection with 3D Queries
Learning a 3D Morphable Face Reflectance Model from Low-Cost Data
AUNet: Learning Relations between Action Units for Face Forgery Detection
3D Human Pose Estimation with Spatio-Temporal Criss-Cross Attention
Implicit 3D Human Mesh Recovery using Consistency with Pose and Shape from Unseen-View
3D Human Keypoints Estimation from Point Clouds in the Wild without Human Labels
Multi-Label Compound Expression Recognition: C-EXPR Database & Network
FlexNeRF: Photorealistic Free-Viewpoint Rendering of Moving Humans from Sparse Views
Two-Stage Co-Segmentation Network based on Discriminative Representation for Recovering Human Mesh from Videos
Co-Speech Gesture Synthesis by Reinforcement Learning with Contrastive Pre-trained Rewards
FeatER: An Efficient Network for Human Reconstruction via Feature Map-based TransformER

Transfer, Meta, Low-Shot, Continual, or Long-Tail Learning

Title	Repo	Paper	Video
Dynamically Instance-guided Adaptation: A Backward-free Approach for Test-Time Domain Adaptive Semantic Segmentation
DETR with Additional Global Aggregation for Cross-Domain Weakly Supervised Object Detection
Mind the Label Shift of Augmentation-based Graph OOD Generalization
Long-Tailed Visual Recognition via Self-Heterogeneous Integration with Knowledge Excavation
Understanding and Improving Visual Prompting: A Label-Mapping Perspective
A Whac-A-Mole Dilemma: Shortcuts Come in Multiples where Mitigating One Amplifies Others
Improved Distribution Matching for Dataset Condensation
Divide and Adapt: Active Domain Adaptation via Customized Learning
Class Relationship Embedded Learning for Source-Free Unsupervised Domain Adaptation
Diversity-Aware Meta Visual Prompting
Uncertainty-Aware Optimal Transport for Semantically Coherent Out-of-Distribution Detection
Zero-Shot Object Counting
Learning with Fantasy: Semantic-Aware Virtual Contrastive Constraint for Few-Shot Class-Incremental Learning
Distribution Shift Inversion for Out-of-Distribution Prediction
Endpoints Weight Fusion for Class Incremental Semantic Segmentation
Promoting Semantic Connectivity: Dual Nearest Neighbors Contrastive Learning for Unsupervised Domain Generalization
Class-Conditional Sharpness-Aware Minimization for Deep Long-tailed Recognition
Meta-Causal Learning for Single Domain Generalization
VoP: Text-Video Co-Operative Prompt Tuning for Cross-Modal Retrieval
Learning Imbalanced Data with Vision Transformers
Sharpness-Aware Gradient Matching for Domain Generalization
Geometry and Uncertainty-Aware 3D Point Cloud Class-Incremental Semantic Segmentation
Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation
Regularizing Second-Order Influences for Continual Learning
I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification
FREDOM: Fairness Domain Adaptation Approach to Semantic Scene Understanding
Dense Network Expansion for Class Incremental Learning
Batch Model Consolidation: A Multi-Task Model Consolidation Framework
DiGeo: Discriminative Geometry-Aware Learning for Generalized Few-Shot Object Detection
Supervised Masked Knowledge Distillation for Few-Shot Transformers
ALOFT: A Lightweight MLP-Like Architecture with Dynamic Low-Frequency Transform for Domain Generalization
ZegCLIP: Towards Adapting CLIP for Zero-Shot Semantic Segmentation
DiGA: Distil to Generalize and then Adapt for Domain Adaptive Semantic Segmentation
Adjustment and Alignment for Unbiased Open Set Domain Adaptation
Adapting Shortcut with Normalizing Flow: An Efficient Tuning Framework for Visual Recognition
CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning
ConStruct-VL: Data-Free Continual Structured VL Concepts Learning
Generalizing Dataset Distillation via Deep Generative Prior
Few-Shot Learning with Visual Distribution Calibration and Cross-Modal Distribution Alignment
Multi-Centroid Task Descriptor for Dynamic Class Incremental Inference
DAA: A Delta Age AdaIN Operation for Age Estimation via Binary Code Transformer
Bilateral Memory Consolidation for Continual Learning
Texts as Images in Prompt Tuning for Multi-Label Image Recognition
Learning Transformations To Reduce the Geometric Shift in Object Detection
CLIP the Gap: A Single Domain Generalization Approach for Object Detection
Transfer Knowledge from Head to Tail: Uncertainty Calibration under Long-tailed Distribution
Bi-Directional Distribution Alignment for Transductive Zero-Shot Learning
DARE-GRAM: Unsupervised Domain Adaptation Regression by Aligning Inverse Gram Matrices
LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models
Open-Set Likelihood Maximization for Few-Shot Learning
WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation
Federated Domain Generalization with Generalization Adjustment
ProtoCon: Pseudo-Label Refinement via Online Clustering and Prototypical Consistency for Efficient Semi-Supervised Learning
DA-DETR: Domain Adaptive Detection Transformer with Information Fusion
Harmonious Teacher for Cross-Domain Object Detection
AutoLabel: CLIP-based Framework for Open-Set Video Domain Adaptation
Task Difficulty Aware Parameter Allocation & Regularization for Lifelong Learning
Revisiting Prototypical Network for Cross Domain Few-Shot Learning
Federated Incremental Semantic Segmentation
Semantic Prompt for Few-Shot Image Recognition
Rethinking Gradient Projection Continual Learning: Stability/Plasticity Feature Space Decoupling
No One Left Behind: Improving the Worst Categories in Long-Tailed Learning
Meta Omnium: A Benchmark for General-Purpose Learning-to-Learn
Transductive Few-Shot Learning with Prototype-based Label Propagation by Iterative Graph Refinement
COT: Unsupervised Domain Adaptation with Clustering and Optimal Transport
Semi-Supervised Domain Adaptation with Source Label Adaptation
MetaMix: Towards Corruption-Robust Continual Learning with Temporally Self-Adaptive Data Transformation
Visual-Language Prompt Tuning with Knowledge-guided Context Optimization
Modeling Inter-Class and Intra-Class Constraints in Novel Class Discovery
Real-Time Evaluation in Online Continual Learning: A New Hope
Partial Network Cloning
Rebalancing Batch Normalization for Exemplar-based Class-Incremental Learning
EcoTTA: Memory-Efficient Continual Test-Time Adaptation via Self-Distilled Regularization
Feature Alignment and Uniformity for Test Time Adaptation
Bootstrap Your Own Prior: Towards Distribution-Agnostic Novel Class Discovery
Towards Realistic Long-Tailed Semi-Supervised Learning: Consistency Is All You Need
Balanced Product of Calibrated Experts for Long-Tailed Recognition
Unsupervised Continual Semantic Adaptation through Neural Rendering
Computationally Budgeted Continual Learning: What Does Matter?
AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning
Ground-Truth Free Meta-Learning for Deep Compressive Sampling
Multi-Level Logit Distillation
StyleAdv: Meta Style Adversarial Training for Cross-Domain Few-Shot Learning
MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation
On the Stability-Plasticity Dilemma of Class-Incremental Learning
TeSLA: Test-Time Self-Learning With Automatic Adversarial Augmentation
MHPL: Minimum Happy Points Learning for Active Source Free Domain Adaptation
CIGAR: Cross-Modality Graph Reasoning for Domain Adaptive Object Detection
Adaptive Plasticity Improvement for Continual Learning
Achieving a Better Stability-Plasticity Trade-Off via Auxiliary Networks in Continual Learning
Few-Shot Geometry-Aware Keypoint Localization
Spatio-Temporal Pixel-Level Contrastive Learning-based Source-Free Domain Adaptation for Video Semantic Segmentation
Both Style and Distortion Matter: Dual-Path Unsupervised Domain Adaptation for Panoramic Semantic Segmentation
Bi-Level Meta-Learning for Few-Shot Domain Generalization
Few-Shot Referring Relationships in Videos
Exploring Data Geometry for Continual Learning
Masked Images Are Counterfactual Samples for Robust Fine-Tuning
DKT: Diverse Knowledge Transfer Transformer for Class Incremental Learning
CoMFormer: Continual Learning in Semantic and Panoptic Segmentation
Global and Local Mixture Consistency Cumulative Learning for Long-tailed Visual Recognitions
Class Attention Transfer based Knowledge Distillation
Hard Sample Matters a Lot in Zero-Shot Quantization
Back to the Source: Diffusion-Driven Adaptation to Test-Time Corruption
SuperDisco: Super-Class Discovery Improves Visual Recognition for the Long-Tail
Architecture, Dataset and Model-Scale Agnostic Data-Free Meta-Learning
Preserving Linear Separability in Continual Learning by Backward Feature Projection
Upcycling Models under Domain and Category Shift
Class-Incremental Exemplar Compression for Class-Incremental Learning
Learning Conditional Attributes for Compositional Zero-Shot Learning
BlackVIP: Black-Box Visual Prompting for Robust Transfer Learning
NoisyTwins: Class-Consistent and Diverse Image Generation through StyleGANs
Semi-Supervised Learning Made Simple with Self-Supervised Clustering
Guiding Pseudo-Labels with Uncertainty Estimation for Source-Free Unsupervised Domain Adaptation
PCR: Proxy-based Contrastive Replay for Online Class-Incremental Continual Learning
Modality-Agnostic Debiasing for Single Domain Generalization
Robust Mean Teacher for Continual and Gradual Test-Time Adaptation
Foundation Model Drives Weakly Incremental Learning for Semantic Segmentation
Hubs and Hyperspheres: Reducing Hubness and Improving Transductive Few-Shot Learning with Hyperspherical Embeddings
Robust Test-Time Adaptation in Dynamic Scenarios
Source-Free Video Domain Adaptation with Spatial-Temporal-Historical Consistency Learning
Heterogeneous Continual Learning
Continual Detection Transformer for Incremental Object Detection
NIFF: Alleviating Forgetting in Generalized Few-Shot Object Detection via Neural Instance Feature Forging
ViewNet: A Novel Projection-based Backbone with View Pooling for Few-Shot Point Cloud Classification
C-SFDA: A Curriculum Learning Aided Self-Training Framework for Efficient Source Free Domain Adaptation
Train/Test-Time Adaptation with Retrieval
Dealing with Cross-Task Class Discrimination in Online Continual Learning
Visual Query Tuning: Towards Effective Usage of Intermediate Representations for Parameter and Memory Efficient Transfer Learning
Decoupling Learning and Remembering: A Bilevel Memory Framework with Knowledge Projection for Task-Incremental Learning
Neuro-Modulated Hebbian Learning for Fully Test-Time Adaptation
TIPI: Test Time Adaptation with Transformation Invariance
Meta-Learning with a Geometry-Adaptive Preconditioner
Meta-Tuning Loss Functions and Data Augmentation for Few-Shot Object Detection
A Probabilistic Framework for Lifelong Test-Time Adaptation
Few-Shot Class-Incremental Learning via Class-Aware Bilateral Distillation
CafeBoost: Causal Feature Boost to Eliminate Task-Induced Bias for Class Incremental Learning
A Strong Baseline for Generalized Few-Shot Semantic Segmentation
Towards Better Stability and Adaptability: Improve Online Self-Training for Model Adaptation in Semantic Segmentation
A New Benchmark: On the Utility of Synthetic Data with Blender for Bare Supervised Learning and Downstream Domain Adaptation
Cross-Image-Attention for Conditional Embeddings in Deep Metric Learning
Principles of Forgetting in Domain-Incremental Semantic Segmentation in Adverse Weather Conditions
Data-Free Knowledge Distillation via Feature Exchange and Activation Region Constraint
(ML)²P-Encoder: On Exploration of Channel-Class Correlation for Multi-Label Zero-Shot Learning
Finetune Like You Pretrain: Improved Finetuning of Zero-Shot Vision Models
Simulated Annealing in Early Layers Leads to Better Generalization
A Data-based Perspective on Transfer Learning
Learning Expressive Prompting with Residuals for Vision Transformers
Boosting Transductive Few-Shot Fine-Tuning with Margin-based Uncertainty Weighting and Probability Regularization
Improving Generalization with Domain Convex Game
Patch-Mix Transformer for Unsupervised Domain Adaptation: A Game Perspective
Guided Recommendation for Model Fine-Tuning
Improving Generalization of Meta-Learning with Inverted Regularization at Inner-Level
Hint-Aug: Drawing Hints from Foundation Vision Transformers towards Boosted Few-Shot Parameter-Efficient Tuning

Recognition: Categorization, Detection, Retrieval

Title	Repo	Paper	Video
R²Former: Unified Retrieval and Reranking Transformer for Place Recognition
Mask-Free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations
StructVPR: Distill Structural Knowledge with Weighting Samples for Visual Place Recognition
MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
One-to-Few Label Assignment for End-to-End Dense Detection
Where is my Wallet? Modeling Object Proposal Sets for Egocentric Visual Query Localization
Semi-DETR: Semi-Supervised Object Detection with Detection Transformers
Universal Instance Perception as Object Discovery and Retrieval
CAT: LoCalization and IdentificAtion Cascade Detection Transformer for Open-World Object Detection
Phase-Shifting Coder: Predicting Accurate Orientation in Oriented Object Detection
FrustumFormer: Adaptive Instance-Aware Resampling for Multi-View 3D Detection
Box-Level Active Detection
Learning with Noisy Labels via Self-Supervised Adversarial Noisy Masking
Ambiguity-Resistant Semi-Supervised Learning for Dense Object Detection
Aligning Bag of Regions for Open-Vocabulary Object Detection
Asymmetric Feature Fusion for Image Retrieval
3D Video Object Detection with Learnable Object-Centric Global Optimization
Enhanced Training of Query-based Object Detection via Selective Query Recollection
Dense Distinct Query for End-to-End Object Detection
On-the-Fly Category Discovery
ProD: Prompting-to-Disentangle Domain Knowledge for Cross-Domain Few-Shot Image Classification
Q-DETR: An Efficient Low-Bit Quantized Detection Transformer
SAP-DETR: Bridging the Gap between Salient Points and Queries-based Transformer Detector for Fast Model Convergency
An Erudite Fine-grained Visual Classification Model
Self-Supervised Implicit Glyph Attention for Text Recognition
Multi-View Adversarial Discriminator: Mine the Non-Causal Factors for Object Detection in Unseen Domains
HIER: Metric Learning Beyond Class Labels via Hierarchical Regularization
DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets
Progressive Semantic-Visual Mutual Adaption for Generalized Zero-Shot Learning
Fake it Till You make it: Learning Transferable Representations from Synthetic ImageNet Clones
FFF: Fragment-guided Flexible Fitting for Building Complete Protein Structures
Revisiting Self-Similarity: Structural Embedding for Image Retrieval
Neural Koopman Pooling: Control-Inspired Temporal Dynamics Encoding for Skeleton-based Action Recognition
MixTeacher: Mining Promising Labels with Mixed Scale Teacher for Semi-Supervised Object Detection
Learning Attention as Disentangler for Compositional Zero-Shot Learning
Towards Building Self-Aware Object Detectors via Reliable Uncertainty Quantification and Calibration
Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection
SOOD: Towards Semi-Supervised Oriented Object Detection
Bias-Eliminating Augmentation Learning for Debiased Federated Learning
Towards Efficient use of Multi-Scale Features in Transformer-based Object Detectors
AsyFOD: An Asymmetric Adaptation Paradigm for Few-Shot Domain Adaptive Object Detection
CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching
Explicit Boundary Guided Semi-Push-Pull Contrastive Learning for Supervised Anomaly Detection
Disentangled Representation Learning for Unsupervised Neural Quantization
YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors
Virtual Sparse Convolution for Multimodal 3D Object Detection
TranSG: Transformer-based Skeleton Graph Prototype Contrastive Learning with Structure-Trajectory Prompted Reconstruction for Person Re-Identification
Adaptive Sparse Pairwise Loss for Object Re-Identification
Multi-Granularity Archaeological Dating of Chinese Bronze Dings based on a Knowledge-guided Relation Graph
Event-guided Person Re-Identification via Sparse-Dense Complementary Learning
Vector Quantization with Self-Attention for Quality-Independent Representation Learning
Siamese Image Modeling for Self-Supervised Vision Representation Learning
FCC: Feature Clusters Compression for Long-tailed Visual Recognition
Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information
Soft Augmentation for Image Classification
Correspondence Transformers with Asymmetric Feature Learning and Matching Flow Super-Resolution
Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models
Out-of-Distributed Semantic Pruning for Robust Semi-Supervised Learning
Glocal Energy-based Learning for Few-Shot Open-Set Recognition
Improving Image Recognition by Retrieving from Web-Scale Image-Text Data
Deep Factorized Metric Learning
Learning to Detect and Segment for Open Vocabulary Object Detection
ConQueR: Query Contrast Voxel-DETR for 3D Object Detection
Photo Pre-Training, But for Sketch
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
Detecting Everything in the Open World: Towards Universal Object Detection
Twin Contrastive Learning with Noisy Labels
Feature Aggregated Queries for Transformer-based Video Object Detectors
Learning on Gradients: Generalized Artifacts Representation for GAN-Generated Images Detection
Deep Hashing with Minimal-Distance-Separated Hash Centers
Knowledge Combination to Learn Rotated Detection without Rotated Annotation
Good is Bad: Causality Inspired Cloth-Debiasing for Cloth-Changing Person Re-Identification
Discriminating Known from Unknown Objects via Structure-Enhanced Recurrent Variational AutoEncoder
2PCNet: Two-Phase Consistency Training for Day-to-Night Unsupervised Domain Adaptive Object Detection
LINe: Out-of-Distribution Detection by Leveraging Important Neurons
Progressive Transformation Learning for Leveraging Virtual Images in Training
Instance Relation Graph Guided Source-Free Domain Adaptive Object Detection
Decoupling MaxLogit for Out-of-Distribution Detection
Pixels, Regions, and Objects: Multiple Enhancement for Salient Object Detection
Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding
BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision
D²Former: Jointly Learning Hierarchical Detectors and Contextual Descriptors via Agent-based Transformers
CapDet: Unifying Dense Captioning and Open-World Detection Pretraining
Mapping Degeneration Meets Label Evolution: Learning Infrared Small Target Detection with Single Point Supervision
Generalized UAV Object Detection via Frequency Domain Disentanglement
Deep Frequency Filtering for Domain Generalization
Adaptive Sparse Convolutional Networks with Global Context Enhancement for Faster Object Detection on Drone Images
Improved Test-Time Adaptation for Domain Generalization
Matching Is Not Enough: A Two-Stage Framework for Category-Agnostic Pose Estimation
Recurrence without Recurrence: Stable Video Landmark Detection with Deep Equilibrium Models
VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision
DETRs with Hybrid Matching
Query-Dependent Video Representation for Moment Retrieval and Highlight Detection
Clothing-Change Feature Augmentation for Person Re-Identification
Learning Attribute and Class-Specific Representation Duet for Fine-grained Fashion Analysis
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks
Optimal Proposal Learning for Deployable End-to-End Pedestrian Detection
DynamicDet: A Unified Dynamic Architecture for Object Detection
Switchable Representation Learning Framework with Self-Compatibility
DATE: Domain Adaptive Product Seeker for E-Commerce
PromptCAL: Contrastive Affinity Learning via Auxiliary Prompts for Generalized Novel Category Discovery
Dynamic Neural Network for Multi-Task Learning Searching across Diverse Network Topologies
OvarNet: Towards Open-Vocabulary Object Attribute Recognition
HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models
Learning from Noisy Labels with Decoupled Meta Label Purifier
A Light Touch Approach to Teaching Transformers Multi-View Geometry
OpenMix: Exploring Outlier Samples for Misclassification Detection
Revisiting Reverse Distillation for Anomaly Detection
PROB: Probabilistic Objectness for Open World Object Detection
Equiangular Basis Vectors
Weakly Supervised Posture Mining for Fine-grained Classification
An Actor-Centric Causality Graph for Asynchronous Temporal Inference in Group Activity
Weak-Shot Object Detection through Mutual Knowledge Transfer
Zero-Shot Everything Sketch-based Image Retrieval, and in Explainable Style
Exploring Structured Semantic Prior for Multi Label Recognition with Incomplete Labels
Learning Partial Correlation based Deep Visual Representation for Image Classification
Boundary-aware Backward-Compatible Representation via Adversarial Learning in Image Retrieval
PHA: Patch-Wise High-Frequency Augmentation for Transformer-based Person Re-Identification
Unknown Sniffer for Object Detection: Don't Turn a Blind Eye to Unknown Objects
BoxTeacher: Exploring High-Quality Pseudo Labels for Weakly Supervised Instance Segmentation
Annealing-based Label-Transfer Learning for Open World Object Detection
Diversity-Measurable Anomaly Detection
Recurrent Vision Transformers for Object Detection with Event Cameras
AShapeFormer: Semantics-guided Object-Level Active Shape Encoding for 3D Object Detection via Transformers
Ranking Regularization for Critical Rare Classes: Minimizing False Positives at a High True Positive Rate
Contrastive Mean Teacher for Domain Adaptive Object Detectors
Bridging the Gap between Model Explanations in Partially Annotated Multi-Label Classification
PartMix: Regularization Strategy to Learn Part Discovery for Visible-Infrared Person Re-Identification
BiasAdv: Bias-Adversarial Augmentation for Model Debiasing
ViPLO: Vision Transformer based Pose-Conditioned Self-Loop Graph for Human-Object Interaction Detection
Robust 3D Shape Classification via Non-Local Graph Attention Network
Two-Way Multi-Label Loss
Normalizing Flow based Feature Synthesis for Outlier-Aware Object Detection
Object Detection with Self-Supervised Scene Adaptation
Data-Efficient Large Scale Place Recognition with Graded Similarity Supervision
Generating Features with Increased Crop-related Diversity for Few-Shot Object Detection
Recognizing Rigid Patterns of Unlabeled Point Clouds by Complete and Continuous Isometry Invariants with no False Negatives and no False Positives
Deep Semi-Supervised Metric Learning with Mixed Label Propagation
Fine-grained Classification with Noisy Labels

Vision, Language, and Reasoning

Title	Repo	Paper	Video
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models
Iterative Proposal Refinement for Weakly-Supervised Video Grounding
MetaCLUE: Towards Comprehensive Visual Metaphors Research
PolyFormer: Referring Image Segmentation as Sequential Polygon Generation
GeneCIS: A Benchmark for General Conditional Image Similarity
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks
Generative Bias for Robust Visual Question Answering
Advancing Visual Grounding with Scene Knowledge: Benchmark and Method
Gloss Attention for Gloss-Free Sign Language Translation
You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos
Generalized Decoding for Pixel, Image, and Language
Accelerating Vision-Language Pretraining with Free Language Modeling
GRES: Generalized Referring Expression Segmentation
BUFFER: Balancing Accuracy, Efficiency, and Generalizability in Point Cloud Registration
RGB no more: Minimally-decoded JPEG Vision Transformers
Scaling Language-Image Pre-Training via Masking
EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding
RefTeacher: A Strong Baseline for Semi-Supervised Referring Expression Comprehension
Mobile User Interface Element Detection Via Adaptively Prompt Tuning
Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training
Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation
Meta Compositional Referring Expression Segmentation
VindLU: A Recipe for Effective Video-and-Language Pretraining
Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual Reasoning
GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods
Learning Customized Visual Models with Retrieval-Augmented Knowledge
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations
Clover: Towards a Unified Video-Language Alignment and Fusion Model
Dual Alignment Unsupervised Domain Adaptation for Video-Text Retrieval
Task Residual for Tuning Vision-Language Models
Dream3D: Zero-Shot Text-to-3D Synthesis using 3D Shape Prior and Text-to-Image Diffusion Models
End-to-End 3D Dense Captioning with Vote2Cap-DETR
Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training
Adaptive Zone-Aware Hierarchical Planner for Vision-Language Navigation
Visual Programming: Compositional Visual Reasoning without Training
Exploring the Effect of Primitives for Compositional Generalization in Vision-and-Language
Referring Multi-Object Tracking
MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis
MIST : Multi-Modal Iterative Spatial-Temporal Transformer for Long-Form Video Question Answering
Learning to Segment Every Referring Object Point by Point
Contrastive Grouping with Transformer for Referring Image Segmentation
Prototype-based Embedding Network for Scene Graph Generation
Collaborative Static and Dynamic Vision-Language Streams for Spatio-Temporal Video Grounding
S³C: Semi-Supervised VQA Natural Language Explanation via Self-Critical Learning
REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory
Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering
Cap4Video: What can Auxiliary Captions do for Text-Video Retrieval?
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning
Zero-Shot Referring Image Segmentation with Global-Local Context Features
Hierarchical Semantic Correspondence Networks for Video Paragraph Grounding
Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training
Probabilistic Prompt Learning for Dense Prediction
DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-Training via Word-Region Alignment
All in One: Exploring Unified Video-Language Pre-Training
Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding
Divide and Conquer: Answering Questions with Object Factorization and Compositional Reasoning
ConZIC: Controllable Zero-Shot Image Captioning by Sampling-based Polishing
RefCLIP: A Universal Teacher for Weakly Supervised Referring Expression Comprehension
KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation
ANetQA: A Large-Scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos
ViLEM: Visual-Language Error Modeling for Image-Text Retrieval
Multi-Modal Representation Learning with Text-Driven Soft Masks
Meta-Personalizing Vision-Language Models to Find Named Instances in Video
ReCo: Region-Controlled Text-to-Image Generation
Are Deep Neural Networks SMARTer than Second Graders?
Graph Representation for Order-Aware Visual Transformation
3D Concept Learning and Reasoning from Multi-View Images
Text with Knowledge Graph Augmented Transformer for Video Captioning
Crossing the Gap: Domain Generalization for Image Captioning
MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering
VQACL: A Novel Visual Question Answering Continual Learning Setting
Improving Selective Visual Question Answering by Learning from Your Peers
High-Fidelity 3D Face Generation from Natural Language Descriptions
Language-guided Audio-Visual Source Separation via Trimodal Consistency
Test of Time: Instilling Video-Language Models with a Sense of Time
Learning Situation Hyper-Graphs for Video Question Answering
Pic2Word: Mapping Pictures to Words for Zero-Shot Composed Image Retrieval
Fine-grained Audible Video Description
Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering
A-Cap: Anticipation Captioning with Commonsense Knowledge
Cross-Domain Image Captioning with Discriminative Finetuning
Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations
The Dialog Must Go on: Improving Visual Dialog via Generative Self-Training
Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!
Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
Similarity Maps for Self-Training Weakly-Supervised Phrase Grounding
Language Adaptive Weight Generation for Multi-Task Visual Grounding
CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers
A Simple Framework for Text-Supervised Semantic Segmentation
Learning to Name Classes for Vision and Language Models
Iterative Vision-and-Language Navigation
Behavioral Analysis of Vision-and-Language Navigation Agents
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-Channel Video-Language Retrieval
SynthVSR: Scaling Up Visual Speech Recognition with Synthetic Supervision
METransformer: Radiology Report Generation by Transformer with Multiple Learnable Expert Tokens
Fusing Pre-trained Language Models with Multimodal Prompts through Reinforcement Learning
Abstract Visual Reasoning: An Algebraic Approach for Solving Raven's Progressive Matrices
Hierarchical Prompt Learning for Multi-Task Learning
Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval
SViTT: Temporal Learning of Sparse Video-Text Transformers
How You Feelin'? Learning Emotions and Mental States in Movie Scenes
Logical Implications for Visual Question Answering Consistency
Image as a Foreign Language: BEIT Pretraining for Vision and Vision-Language Tasks
DeCo: Decomposition and Reconstruction for Compositional Temporal Grounding via Coarse-to-Fine Contrastive Ranking
iCLIP: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition
Semantic-Conditional Diffusion Networks for Image Captioning
CREPE: Can Vision-Language Foundation Models Reason Compositionally?
RMLVQA: A Margin Loss Approach for Visual Question Answering with Language Biases
Improving Vision-and-Language Navigation by Generating Future-View Image Semantics
Prefix Conditioning Unifies Language and Label Supervision
A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning
From Images to Textual Prompts: Zero-Shot Visual Question Answering with Frozen Large Language Models
Hierarchical Video-Moment Retrieval and Step-Captioning

Low-Level Vision

Title	Repo	Paper	Video
Activating More Pixels in Image Super-Resolution Transformer
MetaFusion: Infrared and Visible Image Fusion via Meta-Feature Embedding from Object Detection
Omni Aggregation Networks for Lightweight Image Super-Resolution
Blur Interpolation Transformer for Real-World Motion from Blur
Equivalent Transformation and Dual Stream Network Construction for Mobile Image Super-Resolution
Masked Image Training for Generalizable Deep Image Denoising
CutMIB: Boosting Light Field Super-Resolution via Multi-View Image Blending
Learning Semantic-Aware Knowledge Guidance for Low-Light Image Enhancement
Learning A Sparse Transformer Network for Effective Image Deraining
Deep Discriminative Spatial and Temporal Network for Efficient Video Deblurring
Learning Weather-General and Weather-Specific Features for Image Restoration Under Multiple Adverse Weather Conditions
AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation
Self-Supervised Non-Uniform Kernel Estimation with Flow-based Motion Prior for Blind Image Deblurring
OSRT: Omnidirectional Image Super-Resolution with Distortion-Aware Transformer
Toward Accurate Post-Training Quantization for Image Super Resolution
Learning a Simple Low-Light Image Enhancer from Paired Low-Light Instances
Joint Appearance and Motion Learning for Efficient Rolling Shutter Correction
Cascaded Local Implicit Transformer for Arbitrary-Scale Super-Resolution
Unsupervised Cumulative Domain Adaptation for Foggy Scene Optical Flow
PyramidFlow: High-Resolution Defect Contrastive Localization using Pyramid Normalizing Flow
DR2: Diffusion-based Robust Degradation Remover for Blind Face Restoration
DNF: Decouple and Feedback Network for Seeing in the Dark
Optimization-Inspired Cross-Attention Transformer for Compressive Sensing
Local Implicit Normalizing Flow for Arbitrary-Scale Image Super-Resolution
Event-based Frame Interpolation with Ad-hoc Deblurring
Better `CMOS` Produces Clearer Images: Learning Space-Variant Blur Estimation for Blind Image Super-Resolution
SMAE: Few-Shot Learning for HDR Deghosting with Saturation-Aware Masked Autoencoders
A Unified HDR Imaging Method with Pixel and Patch Level
DegAE: A New Pretraining Paradigm for Low-Level Vision
CABM: Content-Aware Bit Mapping for Single Image Super-Resolution Network with Large Input
Blind Video Deflickering by Neural Filtering with a Flawed Atlas
Efficient and Explicit Modelling of Image Hierarchies for Image Restoration
Learning Distortion Invariant Representation for Image Restoration from a Causality Perspective
Human Guided Ground-truth Generation for Realistic Image Super-Resolution
Raw Image Reconstruction with Learned Compact Metadata
Curricular Contrastive Regularization for Physics-Aware Single Image Dehazing
ShadowDiffusion: When Degradation Prior Meets Diffusion Model for Shadow Removal
N-Gram in Swin Transformers for Efficient Lightweight Image Super-Resolution
Real-time 6K Image Rescaling with Rate-Distortion Optimization
GamutMLP: A Lightweight MLP for Color Loss Recovery
CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion
Quality-Aware Pre-trained Models for Blind Image Quality Assessment
Recurrent Homography Estimation using Homography-guided Image Warping and Focus Transformer
Learning Spatial-Temporal Implicit Neural Representations for Event-guided Video Super-Resolution
RIDCP: Revitalizing Real Image Dehazing via High-Quality Codebook Priors
Generating Aligned Pseudo-Supervision from Non-Aligned Data for Image Restoration in Under-Display Camera
Structure Aggregation for Cross-Spectral Stereo Image Guided Denoising
Rethinking Optical Flow from Geometric Matching Consistent Perspective
Video Dehazing via a Multi-Range Temporal Alignment Network with Physical Prior
Perception-Oriented Single Image Super-Resolution using Optimal Objective Estimation
Zero-Shot Dual-Lens Super-Resolution
Efficient Frequency Domain-based Transformers for High-Quality Image Deblurring
A Simple Baseline for Video Restoration with Grouped Spatial-Temporal Shift
Learning Generative Structure Prior for Blind Text Image Super-Resolution
Motion Information Propagation for Neural Video Compression
Joint Video Multi-Frame Interpolation and Deblurring under Unknown Exposure Time
Event-based Video Frame Interpolation with Cross-Modal Asymmetric Bidirectional Motion Fields
Learning Sample Relationship for Exposure Correction
Spatially Adaptive Self-Supervised Learning for Real-World Image Denoising
Context-Aware Pretraining for Efficient Blind Image Decomposition
Physics-guided ISO-Dependent Sensor Noise Modeling for Extreme Low-Light Photography
AnyFlow: Arbitrary Scale Optical Flow with Implicit Neural Representation
Complexity-guided Slimmable Decoder for Efficient Deep Video Compression
Bitstream-Corrupted JPEG Images are Restorable: Two-Stage Compensation and Alignment Framework for Image Restoration
Spectral Enhanced Rectangle Transformer for Hyperspectral Image Denoising
Learning from Unique Perspectives: User-Aware Saliency Modeling
DINN360: Deformable Invertible Neural Network for Latitude-Aware 360° Image Rescaling
ABCD: Arbitrary Bitwise Coefficient for De-Quantization
Temporal Attention Unit: Towards Efficient Spatiotemporal Predictive Learning
Learning Steerable Function for Efficient Image Resampling
Revisiting the Stack-based Inverse Tone Mapping
Generative Diffusion Prior for Unified Image Restoration and Enhancement
LG-BPN: Local and Global Blind-Patch Network for Self-Supervised Real-World Denoising
Adaptive Spot-guided Transformer for Consistent Local Feature Matching
SFD2: Semantic-guided Feature Detection and Description
Burstormer: Burst Image Restoration and Enhancement Transformer
DeepLSD: Line Segment Detection and Refinement with Deep Image Gradients
Gated Multi-Resolution Transfer Network for Burst Restoration and Enhancement
Structured Sparsity Learning for Efficient Video Super-Resolution
DNeRV: Modeling Inherent Dynamics via Difference Neural Representation for Videos
Exploring Discontinuity for Video Frame Interpolation
Neural Video Compression with Diverse Contexts
FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation
OPE-SR: Orthogonal Position Encoding for Designing a Parameter-Free Upsampling Module in Arbitrary-Scale Image Super-Resolution
Context-based Trit-Plane Coding for Progressive Image Compression
All-in-One Image Restoration for Unknown Degradations using Adaptive Discriminative Filters for Specific Degradations
Learning to Exploit the Sequence-Specific Prior Knowledge for Image Processing Pipelines Optimization
Nighttime Smartphone Reflective Flare Removal using Optical Center Symmetry Prior
Enhancing Deformable Local Features by Jointly Learning to Detect and Describe Keypoints
Real-Time Controllable Denoising for Image and Video
Compression-Aware Video Super-Resolution
Spatial-Frequency Mutual Learning for Face Super-Resolution
The Treasure Beneath Multiple Annotations: An Uncertainty-Aware Edge Detector
Toward Stable, Interpretable, and Lightweight Hyperspectral Super-Resolution
Modernizing Old Photos Using Multiple References via Photorealistic Style Transfer
Data-Driven Feature Tracking for Event Cameras
LVQAC: Lattice Vector Quantization Coupled with Spatially Adaptive Companding for Efficient Learned Image Compression
Backdoor Attacks Against Deep Image Compression via Adaptive Frequency Trigger
Learning to Detect Mirrors from Videos via Dual Correspondences
Robust Unsupervised StyleGAN Image Restoration
Ingredient-oriented Multi-Degradation Learning for Image Restoration
CiaoSR: Continuous Implicit Attention-in-Attention Network for Arbitrary-Scale Image Super-Resolution
Semi-Supervised Parametric Real-World Image Harmonization
SmartAssign: Learning a Smart Knowledge Assignment Strategy for Deraining and Desnowing
Robust Single Image Reflection Removal Against Adversarial Attacks
PMatch: Paired Masked Image Modeling for Dense Geometric Matching
Extracting Motion and Appearance via Inter-Frame Attention for Efficient Video Frame Interpolation
Residual Degradation Learning Unfolding Framework with Mixing Priors Across Spectral and Spatial for Compressive Spectral Imaging
Visual Recognition-Driven Image Restoration for Multiple Degradation with Intrinsic Semantics Recovery
sRGB Real Noise Synthesizing with Neighboring Correlation-Aware Noise Model
Rethinking Image Super Resolution from Long-Tailed Distribution Learning Perspective
Comprehensive and Delicate: An Efficient Transformer for Image Restoration
Super-Resolution Neural Operator
Neumann Network with Recursive Kernels for Single Image Defocus Deblurring
Discriminative Co-Saliency and Background Mining Transformer for Co-Salient Object Detection
Learning Rotation-Equivariant Features for Visual Correspondence
Patch-Craft Self-Supervised Training for Correlated Image Denoising
Metadata-based RAW Reconstruction via Implicit Neural Functions
Contrastive Semi-Supervised Learning for Underwater Image Restoration via Reliable Bank
Decomposed Soft Prompt Guided Fusion Enhancing for Compositional Zero-Shot Learning
Spectral Bayesian Uncertainty for Image Super-Resolution
DINER: Disorder-Invariant Implicit Neural Representation
NVTC: Nonlinear Vector Transform Coding
HyperCUT: Video Sequence from a Single Blurry Image using Unsupervised Ordering
You Do Not Need Additional Priors or Regularizers in Retinex-based Low-light Image Enhancement
Learning a Practical SDR-to-HDRTV Up-Conversion using New Dataset and Degradation Models

Segmentation, Grouping and Shape Analysis

Title	Repo	Paper	Video
Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos
Vision Transformers are Good Mask Auto-Labelers
Visual Recognition by Request
Ultra-High Resolution Segmentation with Ultra-Rich Context: A Novel Benchmark
AttentionShift: Iteratively Estimated Part-based Attention Map for Pointly Supervised Instance Segmentation
MDQE: Mining Discriminative Query Embeddings to Segment Occluded Instances on Challenging Videos
Look Before You Match: Instance Understanding Matters in Video Object Segmentation
SIM: Semantic-Aware Instance Mask Generation for Box-Supervised Instance Segmentation
EFEM: Equivariant Neural Field Expectation Maximization for 3D Object Segmentation without Scene Supervision
Camouflaged Object Detection with Feature Decomposition and Edge Reconstruction
LOCATE: Localize and Transfer Object Parts for Weakly Supervised Affordance Grounding
OneFormer: One Transformer to Rule Universal Image Segmentation
Mask-Free Video Instance Segmentation
Less is More: Reducing Task and Model Complexity for 3D Point Cloud Semantic Segmentation
InstMove: Instance Motion for Object-Centric Video Segmentation
The Devil is in the Points: Weakly Semi-Supervised Instance Segmentation via Point-guided Mask Representation
Edge-Aware Regional Message Passing Controller for Image Forgery Localization
Interactive Segmentation as Gaussian Process Classification
Semantic-Promoted Debiasing and Background Disambiguation for Zero-Shot Instance Segmentation
Adversarially Masking Synthetic to Mimic Real: Adaptive Noise Injection for Point Cloud Segmentation Adaptation
Generative Semantic Segmentation
Modeling the Distributional Uncertainty for Salient Object Detection Models
Simultaneously Short- and Long-Term Temporal Modeling for Semi-Supervised Video Semantic Segmentation
Out-of-Candidate Rectification for Weakly Supervised Semantic Segmentation
DynaMask: Dynamic Mask Selection for Instance Segmentation
MSeg3D: Multi-Modal 3D Semantic Segmentation for Autonomous Driving
Generalizable Local Feature Pre-Training for Deformable Shape Analysis
Understanding and Improving Features Learned in Deep Functional Maps
G-MSM: Unsupervised Multi-Shape Matching with Graph-based Affinity Priors
Continual Semantic Segmentation with Automatic Memory Sample Selection
FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation
Object Discovery from Motion-guided Tokens
Efficient Mask Correction for Click-based Interactive Image Segmentation
Balancing Logit Variation for Long-tailed Semantic Segmentation
Fuzzy Positive Learning for Semi-Supervised Semantic Segmentation
Learning Open-Vocabulary Semantic Segmentation Models From Natural Language Supervision
Improving Graph Representation for Point Cloud Segmentation via Attentive Filtering
BUOL: A Bottom-Up Framework with Occupancy-Aware Lifting for Panoptic 3D Scene Reconstruction from a Single Image
ACSeg: Adaptive Conceptualization for Unsupervised Semantic Segmentation
CCuantuMM: Cycle-Consistent Quantum-Hybrid Matching of Multiple Shapes
Hierarchical Dense Correlation Distillation for Few-Shot Segmentation
UniDAformer: Unified Domain Adaptive Panoptic Segmentation Transformer via Hierarchical Mask Calibration
FedSeg: Class-Heterogeneous Federated Learning for Semantic Segmentation
Understanding Imbalanced Semantic Segmentation through Neural Collapse
Revisiting Weak-to-Strong Consistency in Semi-Supervised Semantic Segmentation
PartSLIP: Low-Shot Part Segmentation for 3D Point Clouds via Pretrained Image-Language Models
PartDistillation: Learning Parts from Instance Segmentation
Sketch2Saliency: Learning to Detect Salient Objects from Human Drawings
FastInst: A Simple Query-based Model for Real-Time Instance Segmentation
SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation
Semantic Human Parsing via Scalable Semantic Transfer over Multiple Label Domains
Open-Set Semantic Segmentation for Point Clouds via Adversarial Prototype Framework
Hunting Sparsity: Density-guided Contrastive Learning for Semi-Supervised Semantic Segmentation
A Generalized Framework for Video Instance Segmentation
SimpSON: Simplifying Photo Cleanup with Single-Click Distracting Object Segmentation Network
Complete-to-Partial 4D Distillation for Self-Supervised Point Cloud Sequence Representation Learning
Self-Supervised Learning for Multimodal Non-Rigid 3D Shape Matching
Ultrahigh Resolution Image/Video Matting with Spatio-Temporal Sparsity
Style Projected Clustering for Domain Generalized Semantic Segmentation
MarS3D: A Plug-and-Play Motion-Aware Model for Semantic Segmentation on Multi-Scan 3D Point Clouds
Compositor: Bottom-Up Clustering and Compositing for Robust Part and Object Segmentation
Dynamic Focus-Aware Positional Queries for Semantic Segmentation
HGFormer: Hierarchical Grouping Transformer for Domain Generalized Semantic Segmentation
Marching-Primitives: Shape Abstraction from Signed Distance Function
Multimodal Industrial Anomaly Detection via Hybrid Fusion
CLIP is also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation
Weakly Supervised Semantic Segmentation via Adversarial Learning of Classifier and Reconstructor
Conjugate Product Graphs for Globally Optimal 2D-3D Shape Matching
Interactive Segmentation of Radiance Fields
Boundary-enhanced Co-Training for Weakly Supervised Semantic Segmentation
Learning Multi-Modal Class-Specific Tokens for Weakly Supervised Dense Object Localization
Quantum Multi-Model Fitting
Two-Shot Video Object Segmentation
End-to-End Video Matting with Trimap Propagation
ISBNet: A 3D Point Cloud Instance Segmentation Network with Instance-Aware Sampling and Box-Aware Dynamic Convolution
On Calibrating Semantic Segmentation Models: Analyses and an Algorithm
Explicit Visual Prompting for Low-Level Structure Segmentations
Neural Intrinsic Embedding for Non-rigid Point Cloud Matching
Incrementer: Transformer for Class-Incremental Semantic Segmentation with Knowledge Distillation Focusing on Old Class
Camouflaged Instance Segmentation via Explicit De-Camouflaging
Leveraging Hidden Positives for Unsupervised Semantic Segmentation
Rethinking the Correlation in Few-Shot Segmentation: A Buoys View
Sparsely Annotated Semantic Segmentation with Adaptive Gaussian Mixtures
Mask-guided Matting in the Wild
Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention
Conflict-based Cross-View Consistency for Semi-Supervised Semantic Segmentation
Augmentation Matters: A Simple-yet-Effective Approach to Semi-Supervised Semantic Segmentation
Attention-based Point Cloud Edge Sampling
DA Wand: Distortion-Aware Selection using Neural Mesh Parameterization
Extracting Class Activation Maps from Non-Discriminative Features as well
Focused and Collaborative Feedback Integration for Interactive Image Segmentation
Boosting Low-Data Instance Segmentation by Unsupervised Pre-Training with Saliency Prompt
Unsupervised 3D Shape Reconstruction by Part Retrieval and Assembly
MobileVOS: Real-Time Video Object Segmentation Contrastive Learning meets Knowledge Distillation
Transformer Scale Gate for Semantic Segmentation
PIDNet: A Real-Time Semantic Segmentation Network Inspired by PID Controllers
Side Adapter Network for Open-Vocabulary Semantic Segmentation
Test Time Adaptation with Regularized Loss for Weakly Supervised Salient Object Detection
Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers
Reliability in Semantic Segmentation: Are We on the Right Track?
Beyond mAP: Towards Better Evaluation of Instance Segmentation
Heat Diffusion based Multi-Scale and Geometric Structure-Aware Transformer for Mesh Segmentation
Tree Instance Segmentation with Temporal Contour Graph
Exemplar-FreeSOLO: Enhancing Unsupervised Instance Segmentation with Exemplars
Omnimatte3D: Associating Objects and their Effects in Unconstrained Monocular Video
Learning Orthogonal Prototypes for Generalized Few-Shot Semantic Segmentation
Instance-Specific and Model-Adaptive Supervision for Semi-Supervised Semantic Segmentation
Improving Robustness of Semantic Segmentation to Motion-Blur using Class-Centric Augmentation
IFSeg: Image-Free Semantic Segmentation via Vision-Language Model
CLIP-S⁴: Language-guided Self-Supervised Semantic Segmentation
Pruning Parameterization with Bi-Level Optimization for Efficient Semantic Segmentation on the Edge

Deep Learning Architectures and Techniques

Title	Repo	Paper	Video
PA&DA: Jointly Sampling PAth and DAta for Consistent NAS
Top-Down Visual Attention from Analysis by Synthesis
CUF: Continuous Upsampling Filters
Curvature-Balanced Feature Manifold Learning for Long-tailed Classification
Neighborhood Attention Transformer
Progressive Random Convolutions for Single Domain Generalization
Domain Expansion of Image Generators
Gradient Norm Aware Minimization Seeks First-Order Flatness and Improves Generalization
Boosting Verified Training for Robust Image Classifications via Abstraction
Joint Token Pruning and Squeezing Towards more Aggressive Compression of Vision Transformers
Vision Transformer with Super Token Sampling
PointListNet: Deep Learning on 3D Point Lists
Rate Gradient Approximation Attack Threats Deep Spiking Neural Networks
Masked Jigsaw Puzzle: A Versatile Position Embedding for Vision Transformers
Deep Graph Reprogramming
ConvNeXt V2: Co-Designing and Scaling ConvNets with Masked Autoencoders
Frustratingly Easy Regularization on Representation can Boost Deep Reinforcement Learning
Unified Pose Sequence Modeling
RIFormer: Keep Your Vision Backbone Effective but Removing Token Mixer
Real-Time Neural Light Field on Mobile Devices
Towards Scalable Neural Representation for Diverse Videos
AutoFocusFormer: Image Segmentation off the Grid
Minimizing the Accumulated Trajectory Error to Improve Dataset Distillation			➖
Deep Learning of Partial Graph Matching via Differentiable Top-K
WIRE: Wavelet Implicit Neural Representations
Decompose, Adjust, Compose: Effective Normalization by Playing with Frequency for Domain Generalization
Towards a Smaller Student: Capacity Dynamic Distillation for Efficient Image Retrieval
UniHCP: A Unified Model for Human-Centric Perceptions
Trainable Projected Gradient Method for Robust Fine-Tuning
Boosting Semi-Supervised Learning by Exploiting All Unlabeled Data
B-Spline Texture Coefficients Estimator for Screen Content Image Super-Resolution
Run, Don't Walk: Chasing Higher FLOPS for Faster Neural Networks
HyperMatch: Noise-Tolerant Semi-Supervised Learning via Relaxed Contrastive Constraint
From Node Interaction to Hop Interaction: New Effective and Scalable Graph Learning Paradigm
Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention
On the Pitfall of Mixup for Uncertainty Calibration
Edges to Shapes to Concepts: Adversarial Augmentation for Robust Vision
Mod-Squad: Designing Mixtures of Experts as Modular Multi-Task Learners
DeepMAD: Mathematical Architecture Design for Deep Convolutional Neural Network
PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers
BiFormer: Vision Transformer with Bi-Level Routing Attention
DeSTSeg: Segmentation Guided Denoising Student-Teacher for Anomaly Detection
Hierarchical Neural Memory Network for Low Latency Event Processing
Block Selection Method for using Feature Norm in Out-of-Distribution Detection
NAR-Former: Neural Architecture Representation Learning towards Holistic Attributes Prediction
MDL-NAS: A Joint Multi-Domain Learning Framework for Vision Transformer
VNE: An Effective Method for Improving Deep Representation by Manipulating Eigenvalue Distribution
Multi-Agent Automated Machine Learning
Making Vision Transformers Efficient from a Token Sparsification View
Integral Neural Networks
RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving
MIANet: Aggregating Unbiased Instance and General Information for Few-Shot Semantic Segmentation
One-Shot Model for Mixed-Precision Quantization
Learning Dynamic Style Kernels for Artistic Style Transfer
SVGformer: Representation Learning for Continuous Vector Graphics using Transformers
How to Prevent the Continuous Damage of Noises to Model Training?
GKEAL: Gaussian Kernel Embedded Analytic Learning for Few-Shot Class Incremental Task
Differentiable Architecture Search with Random Features
ERM-KTP: Knowledge-Level Machine Unlearning via Knowledge Transfer
FIANCEE: Faster Inference of Adversarial Networks via Conditional Early Exits
Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers
Token Turing Machines
Co-training 2^L Submodels for Visual Recognition
HOTNAS: Hierarchical Optimal Transport for Neural Architecture Search
SLACK: Stable Learning of Augmentations with Cold-Start and KL Regularization
MarginMatch: Improving Semi-Supervised Learning with Pseudo-Margins
Alias-Free Convnets: Fractional Shift Invariance via Polynomial Activations
Detection of Out-of-Distribution Samples using Binary Neuron Activation Patterns
Mitigating Task Interference in Multi-Task Learning via Explicit Task Routing with Non-Learnable Primitives
Superclass Learning with Representation Enhancement
Perception and Semantic Aware Regularization for Sequential Confidence Calibration
DART: Diversify-Aggregate-Repeat Training Improves Generalization of Neural Networks
Improving Robustness of Vision Transformers by Reducing Sensitivity to Patch Corruptions
E2PN: Efficient SE(3)-Equivariant Point Network
Generalization Matters: Loss Minima Flattening via Parameter Hybridization for Efficient Online Knowledge Distillation
Regularization of Polynomial Networks for Image Recognition
Hyperspherical Embedding for Point Cloud Completion
On the Effectiveness of Partial Variance Reduction in Federated Learning with Heterogeneous Data
Independent Component Alignment for Multi-Task Learning
MP-Former: Mask-piloted Transformer for Image Segmentation
SMPConv: Self-Moving Point Representations for Continuous Convolution
MaskCon: Masked Contrastive Learning for Coarse-labelled Dataset
FlexiViT: One Model for All Patch Sizes
GEN: Pushing the Limits of Softmax-based Out-of-Distribution Detection
Zero-Shot Noise2Noise: Efficient Image Denoising without any Data
Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference
HNeRV: A Hybrid Neural Representation for Videos
Re-Basin via Implicit Sinkhorn Differentiation
Bayesian Posterior Approximation with Stochastic Ensembles
FedDM: Iterative Distribution Matching for Communication-Efficient Federated Learning
Enhancing Multiple Reliability Measures via Nuisance-extended Information Bottleneck
Federated Learning with Data-Agnostic Distribution Fusion

Multi-Modal Learning

Title	Repo	Paper	Video
Pix2Map: Cross-Modal Retrieval for Inferring Street Maps from Images
Audio-Visual Grouping Network for Sound Localization from Mixtures
Learning Semantic Relationship among Instances for Image-Text Matching
Identity-Preserving Talking Face Generation with Landmark and Appearance Priors
ImageBind: One Embedding Space to Bind Them All
Learning to Dub Movies via Hierarchical Prosody Models
OmniMAE: Single Model Masked Pretraining on Images and Videos
CNVid-3.5M: Build, Filter, and Pre-Train the Large-Scale Public Chinese Video-Text Dataset
Egocentric Audio-Visual Object Localization
Learning Visual Representations via Language-guided Sampling
Unite and Conquer: Plug & Play Multi-Modal Synthesis using Diffusion Models
iQuery: Instruments as Queries for Audio-Visual Sound Separation
Diverse Embedding Expansion Network and Low-Light Cross-Modality Benchmark for Visible-Infrared Person Re-Identification
PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-Shot Learners
Non-Contrastive Learning Meets Language-Image Pre-Training
Highly Confident Local Structure based Consensus Graph Learning for Incomplete Multi-View Clustering
Vision Transformers are Parameter-Efficient Audio-Visual Learners
Teaching Structured Vision & Language Concepts to Vision & Language Models
Data-Free Sketch-based Image Retrieval
Align and Attend: Multimodal Summarization with Dual Contrastive Losses
Efficient Multimodal Fusion via Interactive Prompting
Multimodal Prompting with Missing Modalities for Visual Recognition
Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-Commerce
What Happened 3 Seconds Ago? Inferring the Past with Thermal Imaging
MMANet: Margin-Aware Distillation and Modality-Aware Regularization for Incomplete Multimodal Learning
Multi-Modal Learning with Missing Modality via Shared-Specific Feature Modelling
The ObjectFolder Benchmark: Multisensory Learning with Neural and Real Objects
Position-guided Text Prompt for Vision-Language Pre-Training
Conditional Generation of Audio from Video via Foley Analogies
OSAN: A One-Stage Alignment Network to Unify Multimodal Alignment and Unsupervised Domain Adaptation
Self-Supervised Video Forensics by Audio-Visual Anomaly Detection
ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding
AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR
Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring
SceneTrilogy: On Human Scene-Sketch and its Complementarity with Photo and Text
Exploring and Exploiting Uncertainty for Incomplete Multi-View Classification
EXIF as Language: Learning Cross-Modal Associations between Images and Camera Metadata
Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens
RONO: Robust Discriminative Learning with Noisy Labels for 2D-3D Cross-Modal Retrieval
CASP-Net: Rethinking Video Saliency Prediction from an Audio-Visual Consistency Perceptual Perspective
Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning
ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Regeneration
Look, Radiate, and Learn: Self-Supervised Localisation via Radio-Visual Correspondence
Learning Emotion Representations from Verbal and Nonverbal Communication
Enhanced Multimodal Representation Learning with Cross-Modal KD
MELTR: Meta Loss Transformer for Learning to Fine-Tune Video Foundation Models
Multilateral Semantic Relations Modeling for Image Text Retrieval
GeoVLN: Learning Geometry-enhanced Visual Representation with Slot Attention for Vision-and-Language Navigation
Noisy Correspondence Learning with Meta Similarity Correction
Improving Cross-Modal Retrieval with Set of Diverse Embeddings
Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment
MaPLe: Multi-Modal Prompt Learning
Fine-grained Image-Text Matching by Cross-Modal Hard Aligning Network
Towards Modality-Agnostic Person Re-Identification with Descriptive Query
Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos
FashionSAP: Symbols and Attributes Prompt for Fine-grained Fashion Vision-Language Pre-Training
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-Training Model
Egocentric Auditory Attention Localization in Conversations
Improving Zero-Shot Generalization and Robustness of Multi-Modal Models
Understanding and Constructing Latent Modality Structures in Multi-Modal Representation Learning
Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles
GCFAgg: Global and Cross-View Feature Aggregation for Multi-View Clustering
BiCro: Noisy Correspondence Rectification for Multi-Modality Data via Bi-Directional Cross-Modal Similarity Consistency
DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training
Referring Image Matting
Leveraging per Image-Token Consistency for Vision-Language Pre-Training
Seeing what You Miss: Vision-Language Pre-Training with Semantic Completion Learning
Sample-Level Multi-View Graph Clustering
SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation
On the Effects of Self-Supervision and Contrastive Alignment in Deep Multi-View Clustering
SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model
Novel-View Acoustic Synthesis
MAGVLT: Masked Generative Vision-and-Language Transformer
Reproducible Scaling Laws for Contrastive Language-Image Learning
PMR: Prototypical Modal Rebalance for Multimodal Learning
Language-guided Music Recommendation for Video via Prompt Analogies
RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-Training
MMG-Ego4D: Multi-Modal Generalization in Egocentric Action Recognition
Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning
PRISE: Demystifying Deep Lucas-Kanade with Strongly Star-Convex Constraints for Multimodel Image Alignment
Masked Autoencoding Does Not Help Natural Language Supervision at Scale
CLIPPO: Image-and-Language Understanding from Pixels Only
Chat2Map: Efficient Scene Mapping from Multi-Ego Conversations
Critical Learning Periods for Multisensory Integration in Deep Networks
CLIPPING: Distilling CLIP-based Models with a Student base for Video-Language Retrieval
NUWA-LIP: Language-guided Image Inpainting with Defect-Free VQGAN
WINNER: Weakly-Supervised hIerarchical DecompositioN and aligNment for spatio-tEmporal Video gRounding
Multivariate, Multi-Frequency and Multimodal: Rethinking Graph Neural Networks for Emotion Recognition in Conversation

3D from Single Images

Title	Repo	Paper	Video
3D-Aware Multi-Class Image-to-Image Translation with NeRFs
DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-Aware Scene Synthesis
MagicPony: Learning Articulated 3D Animals in the Wild
Seeing a Rose in Five Thousand Ways
FitMe: Deep Photorealistic 3D Morphable Model Avatars
Scalable, Detailed and Mask-Free Universal Photometric Stereo
Spatio-Focal Bidirectional Disparity Estimation from a Dual-Pixel Image
ShapeClipper: Scalable 3D Shape Learning from Single-View Images via Geometric and CLIP-based Consistency
High-Fidelity Clothed Avatar Reconstruction from a Single Image
TexPose: Neural Texture Learning for Self-Supervised 6D Object Pose Estimation
Behind the Scenes: Density Fields for Single View Reconstruction
Reconstructing Animatable Categories from Videos
RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and Generation
Self-Supervised Geometry-Aware Encoder for Style-based 3D GAN Inversion
3D Cinemagraphy from a Single Image
NeuralLift-360: Lifting An In-the-Wild 2D Photo to a 3D Object with 360° Views
iDisc: Internal Discretization for Monocular Depth Estimation
HairStep: Transfer Synthetic to Real using Strand and Depth Maps for Single-View 3D Hair Modeling
NeRFInvertor: High Fidelity NeRF-GAN Inversion for Single-Shot Real Image Animation
NeurOCS: Neural NOCS Supervision for Monocular 3D Object Localization
Multiview Compressive Coding for 3D Reconstruction
FaceLit: Neural 3D Relightable Faces
Rigidity-Aware Detection for 6D Object Pose Estimation
Shape-Constraint Recurrent Flow for 6D Object Pose Estimation
Bringing Inputs to Shared Domains for 3D Interacting Hands Recovery in the Wild
Ref-NPR: Reference-based Non-Photorealistic Radiance Fields for Controllable Scene Stylization
DiffPose: Toward More Reliable 3D Pose Estimation
High-Fidelity 3D GAN Inversion by Pseudo-Multi-View Optimization
Semantic Scene Completion with Cleaner Self
Learned Two-Plane Perspective Prior based Image Resampling for Efficient Object Detection
Mask3D: Pre-Training 2D Vision Transformers by Learning Masked 3D Priors
Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild
Masked Scene Contrast: A Scalable Framework for Unsupervised 3D Representation Learning
Paired-Point Lifting for Enhanced Privacy-Preserving Visual Localization
Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation
gSDF: Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction
Accidental Light Probes
Learning to Predict Scene-Level Implicit 3D from Posed RGBD Data
DPF: Learning Dense Prediction Fields with Weak Supervision
DIFu: Depth-guided Implicit Function for Clothed Human Reconstruction
OrienterNet: Visual Localization in 2D Public Maps with Neural Matching
Im2Hands: Learning Attentive Implicit Representation of Interacting Two-Hand Shapes
Structured 3D Features for Reconstructing Controllable Avatars
Delving into Discrete Normalizing Flows on SO(3) Manifold for Probabilistic Rotation Modeling
High-Fidelity 3D Human Digitization from Single 2K Resolution Images
Learning 3D-Aware Image Synthesis with Unknown Pose Distribution
DP-NeRF: Deblurred Neural Radiance Field with Physical Scene Priors
Recovering 3D Hand Mesh Sequence from a Single Blurry Image: A New Dataset and Temporal Unfolding
Visibility Aware Human-Object Interaction Tracking from Single RGB Camera
SMOC-Net: Leveraging Camera Pose for Self-Supervised Monocular Object Pose Estimation
Curricular Object Manipulation in LiDAR-based Object Detection
SeSDF: Self-evolved Signed Distance Field for Implicit 3D Clothed Human Reconstruction
MonoATT: Online Monocular 3D Object Detection with Adaptive Token Transformer
Shape, Pose, and Appearance from a Single Image via Bootstrapped Radiance Field Inversion
High Fidelity 3D Hand Shape Reconstruction via Scalable Graph Frequency Decomposition
NeRDi: Single-View NeRF Synthesis with Language-guided Diffusion as General Image Priors
ACL-SPC: Adaptive Closed-Loop System for Self-Supervised Point Cloud Completion
Self-Positioning Point-based Transformer for Point Cloud Understanding
H2ONet: Hand-Occlusion-and-Orientation-Aware Network for Real-Time 3D Hand Mesh Reconstruction
A Probabilistic Attention Model with Occlusion-Aware Texture Regression for 3D Hand Reconstruction from a Single RGB Image
Neural Voting Field for Camera-Space 3D Hand Pose Estimation
PLIKS: A Pseudo-Linear Inverse Kinematic Solver for 3D Human Body Estimation
Distilling Neural Fields for Real-Time Articulated Shape Reconstruction
Power Bundle Adjustment for Large-Scale 3D Reconstruction
What You Can Reconstruct from a Shadow
SparseFusion: Distilling View-conditioned Diffusion for 3D Reconstruction
Diverse 3D Hand Gesture Prediction from Body Dynamics by Bilateral Hand Disentanglement
Trap Attention: Monocular Depth Estimation with Manual Traps
Crowd3D: Towards Hundreds of People Reconstruction from a Single Image
PAniC-3D: Stylized Single-View 3D Reconstruction from Portraits of Anime Characters
HS-Pose: Hybrid Scope Feature Extraction for Category-Level Object Pose Estimation
A Hierarchical Representation Network for Accurate and Detailed Face Reconstruction from In-the-Wild Images
Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation
SfM-TTR: Using Structure from Motion for Test-Time Refinement of Single-View Depth Networks
BITE: Beyond Priors for Improved Three-D Dog Pose Estimation
SinGRAF: Learning a 3D Generative Radiance Field for a Single Scene
Flow Supervision for Deformable NeRF
Single Image Depth Prediction Made Better: A Multivariate Gaussian Take
CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects
PanoSwin: A Pano-Style Swin Transformer for Panorama Understanding
CP³: Channel Pruning Plug-In for Point-based Networks
PC²: Projection-Conditioned Point Cloud Diffusion for Single-Image 3D Reconstruction
On the Importance of Accurate Geometry Data for Dense 3D Vision Tasks
Cross-Domain 3D Hand Pose Estimation with Dual Modalities
RealFusion 360° Reconstruction of Any Object from a Single Image
Sampling is Matter: Point-guided 3D Human Mesh Reconstruction
Knowledge Distillation for 6D Pose Estimation by Aligning Distributions of Local Predictions
BAAM: Monocular 3D Pose and Shape Reconstruction with Bi-Contextual Attention Module and Attention-guided Modeling
Single View Scene Scale Estimation using Scale Field
Learning Articulated Shape with Keypoint Pseudo-Labels from Web Images
Deformable Mesh Transformer for 3D Human Mesh Recovery

Medical and Biological Vision, Cell Microscopy

Title	Repo	Paper	Video
Decoupled Semantic Prototypes Enable Learning from Diverse Annotation Types for Semi-Weakly Segmentation in Expert-Driven Domains
Geometric Visual Similarity Learning in 3D Medical Image Self-Supervised Pre-Training
Flexible-C^m GAN: Towards Precise 3D Dose Prediction in Radiotherapy
Bidirectional Copy-Paste for Semi-Supervised Medical Image Segmentation
MagicNet: Semi-Supervised Multi-Organ Segmentation via Magic-Cube Partition and Recovery
Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images
Label-Free Liver Tumor Segmentation
Devil is in the Queries: Advancing Mask Transformers for Real-World Medical Image Segmentation and Out-of-Distribution Localization
DoNet: Deep De-Overlapping Network for Cytology Instance Segmentation
SQUID: Deep Feature In-Painting for Unsupervised Anomaly Detection
Learning Federated Visual Prompt in Null Space for MRI Reconstruction
Pseudo-Label Guided Contrastive Learning for Semi-Supervised Medical Image Segmentation
Image Quality-Aware Diagnosis via Meta-Knowledge Co-Embedding
Iterative Next Boundary Detection for Instance Segmentation of Tree Rings in Microscopy Images of Shrub Cross Sections
Dynamic Graph Enhanced Contrastive Learning for Chest X-Ray Report Generation
Seeing Beyond the Brain: Conditional Diffusion Model with Sparse Masked Modeling for Vision Decoding
Bi-Directional Feature Fusion Generative Adversarial Network for Ultra-High Resolution Pathological Image Virtual Re-Staining
KiUT: Knowledge-Injected U-Transformer for Radiology Report Generation
Weakly Supervised Segmentation with Point Annotations for Histopathology Images via Contrast-based Variational Model
Ambiguous Medical Image Segmentation using Diffusion Models
Causally-Aware Intraoperative Imputation for Overall Survival Time Prediction
Best of Both Worlds: Multimodal Contrastive Learning with Tabular and Imaging Data
GradICON: Approximate Diffeomorphisms via Gradient Inverse Consistency
Fair Federated Medical Image Segmentation via Client Contribution Estimation
Histopathology whole Slide Image Analysis with Heterogeneous Graph Representation Learning
Unsupervised Contour Tracking of Live Cells by Mechanical and Cycle Consistency Losses
Learning to Exploit Temporal Structure for Biomedical Vision–Language Processing
RepMode: Learning to Re-Parameterize Diverse Experts for Subcellular Structure Prediction
Sparse Multi-Modal Graph Transformer with Shared-Context Processing for Representation Learning of Giga-Pixel Images
Towards Trustable Skin Cancer Diagnosis via Rewriting Model's Decision
Task-Specific Fine-Tuning via Variational Information Bottleneck for Weakly-Supervised Pathology whole Slide Image Classification
TINC: Tree-Structured Implicit Neural Compression
Topology-guided Multi-Class Cell Context Generation for Digital Pathology
Directional Connectivity-based Segmentation of Medical Images
A Soma Segmentation Benchmark in Full Adult Fly Brain
Constrained Evolutionary Diffusion Filter for Monocular Endoscope Tracking
Benchmarking Self-Supervised Learning on Diverse Pathology Datasets
DualRel: Semi-Supervised Mitochondria Segmentation from a Prototype Perspective
SDC-UDA: Volumetric Unsupervised Domain Adaptation Framework for Slice-Direction Continuous Cross-Modality Medical Image Segmentation
OCELOT: Overlapped Cell on Tissue Dataset for Histopathology
Orthogonal Annotation Benefits Barely-Supervised Medical Image Segmentation
DeGPR: Deep Guided Posterior Regularization for Multi-Class Cell Detection and Counting
Interactive and Explainable Region-guided Radiology Report Generation
A Loopback Network for Explainable Microvascular Invasion Classification
Interventional Bag Multi-Instance Learning On Whole-Slide Pathological Images
MAESTER: Masked Autoencoder Guided Segmentation at Pixel Resolution for Accurate, Self-Supervised Subcellular Structure Recognition
Neuralizer: General Neuroimage Analysis without Re-Training
Why is the Winner the Best?
Rethinking Few-Shot Medical Segmentation: A Vector Quantization View
PEFAT: Boosting Semi-Supervised Medical Image Classification via Pseudo-Loss Estimation and Feature Adversarial Training
Indescribable Multi-Modal Spatial Evaluator
Multiple Instance Learning via Iterative Self-paced Supervised Contrastive Learning
Hierarchical Discriminative Learning Improves Visual Representations of Biomedical Microscopy

Video: Action and Event Understanding

Title	Repo	Paper	Video
Open Set Action Recognition via Multi-Label Evidential Learning
FLAG3D: A 3D Fitness Activity Dataset with Language Instruction
MoLo: Motion-augmented Long-Short Contrastive Learning for Few-Shot Action Recognition
The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction
Use Your Head: Improving Long-Tail Video Recognition
Decomposed Cross-Modal Distillation for RGB-based Temporal Action Detection
Video Test-Time Adaptation for Action Recognition
How Can Objects Help Action Recognition?
Text-Visual Prompting for Efficient 2D Temporal Video Grounding
Enlarging Instance-Specific and Class-Specific Information for Open-Set Action Recognition
TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition
Learning Video Representations from Large Language Models
Fine-tuned CLIP Models are Efficient Video Learners
Efficient Movie Scene Detection using State-Space Transformers
AdamsFormer for Spatial Action Localization in the Future
A Light Weight Model for Active Speaker Detection
System-Status-Aware Adaptive Network for Online Streaming Video Understanding
STMixer: A One-Stage Sparse Action Detector
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring
Distilling Vision-Language Pre-Training to Collaborate with Weakly-Supervised Temporal Action Localization
Real-Time Multi-Person Eyeblink Detection in the Wild for Untrimmed Video
Modeling Video as Stochastic Processes for Fine-grained Video Representation Learning
Re²TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization
Learning Discriminative Representations for Skeleton based Action Recognition
Learning Procedure-Aware Video Representation from Instructional Videos and Their Narrations
Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio-Visual Event Perception
PivoTAL: Prior-Driven Supervision for Weakly-Supervised Temporal Action Localization
Cascade Evidential Learning for Open-World Weakly-Supervised Temporal Action Localization
Soft-Landing Strategy for Alleviating the Task Discrepancy Problem in Temporal Action Localization Tasks
SVFormer: Semi-Supervised Video Transformer for Action Recognition
AutoAD: Movie Description in Context
STMT: A Spatial-Temporal Mesh Transformer for MoCap-based Action Recognition
Boosting Weakly-Supervised Temporal Action Localization with Text Information
Aligning Step-by-Step Instructional Diagrams to Video Demonstrations
Improving Weakly Supervised Temporal Action Localization by Bridging Train-Test Gap in Pseudo Labels
Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos
Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline
LOGO: A Long-Form Video Dataset for Group Action Quality Assessment
Search-Map-Search: A Frame Selection Paradigm for Action Recognition
3Mformer: Multi-Order Multi-Mode Transformer for Skeletal Action Recognition
ProTeGe: Untrimmed Pretraining for Video Temporal Grounding by Video Temporal Grounding
Egocentric Video Task Translation
Look Around for Anomalies: Weakly-Supervised Anomaly Detection via Context-Motion Relational Learning
Proposal-based Multiple Instance Learning for Weakly-Supervised Temporal Action Localization
TriDet: Temporal Action Detection with Relative Boundary Modeling
Actionlet-Dependent Contrastive Learning for Unsupervised Skeleton-based Action Recognition
EVAL: Explainable Video Anomaly Localization
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
StepFormer: Self-Supervised Step Discovery and Localization in Instructional Videos
Weakly Supervised Temporal Sentence Grounding with Uncertainty-guided Self-Training
Leveraging Temporal Context in Low Representational Power Regimes
PIVOT: Prompting for Video Continual Learning
On the Benefits of 3D Pose and Tracking for Human Action Recognition
NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory
Selective Structured State-Spaces for Long-Form Video Understanding
Frame Flexible Network
ASPnet: Action Segmentation with Shared-Private Representation of Multiple Data Sources
Unified Keypoint-based Action Recognition Framework via Structured Keypoint Pooling
Learning Transferable Spatiotemporal Representations from Natural Script Knowledge
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-Supervised Video Representation Learning
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
Procedure-Aware Pretraining for Instructional Video Understanding
Latency Matters: Real-Time Action Forecasting Transformer
Generating Anomalies for Video Anomaly Detection with Prompt-based Feature Mapping
HierVL: Learning Hierarchical Video-Language Embeddings
Two-Stream Networks for Weakly-Supervised Temporal Action Localization with Semantic-Aware Mechanisms
Hybrid Active Learning via Deep Clustering for Video Action Detection
Prompt-guided Zero-Shot Anomaly Action Recognition using Pretrained Deep Skeleton Features
Unbiased Multiple Instance Learning for Weakly Supervised Video Anomaly Detection
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
PDPP: Projected Diffusion for Procedure Planning in Instructional Videos
Learning Action Changes by Measuring Verb-Adverb Textual Relationships
Reducing the Label Bias for Timestamp Supervised Temporal Action Segmentation
Video Event Restoration based on Keyframes for Video Anomaly Detection
Active Exploration of Multimodal Complementarity for Few-Shot Action Recognition
Vita-CLIP: Video and Text Adaptive CLIP via Multimodal Prompting
Post-Processing Temporal Action Detection
Relational Space-Time Query in Long-Form Videos
Therbligs in Action: Video Understanding through Motion Primitives
Dual-Path Adaptation from Image to Video Transformers
Hierarchical Semantic Contrast for Scene-Aware Video Anomaly Detection
Exploiting Completeness and Uncertainty of Pseudo Labels for Weakly Supervised Video Anomaly Detection
Unbiased Scene Graph Generation in Videos

Autonomous Driving

Title	Repo	Paper	Video
GraVoS: Voxel Selection for 3D Point-Cloud Detection
BEV@DC: Bird's-Eye View Assisted Training for Depth Completion
Are We Ready for Vision-Centric Driving Streaming Perception? The ASAP Benchmark
PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer
End-to-End Vectorized HD-Map Construction with Piecewise Bezier Curve
MoDAR: Using Motion Forecasting for 3D Object Detection in Point Cloud Sequences
LaserMix for Semi-Supervised LiDAR Semantic Segmentation
MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with Multi-Depth Seeds for 3D Object Detection
LiDAR2Map: In Defense of LiDAR-based Semantic Map Construction using Online Camera Distillation
Think Twice before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving
Planning-oriented Autonomous Driving
Distilling Focal Knowledge from Imperfect Expert for 3D Object Detection
Anchor3DLane: Learning to Regress 3D Anchors for Monocular 3D Lane Detection
SliceMatch: Geometry-guided Aggregation for Cross-View Pose Estimation
Azimuth Super-Resolution for FMCW Radar in Autonomous Driving
V2V4Real: A Real-World Large-Scale Dataset for Vehicle-to-Vehicle Cooperative Perception
Implicit Occupancy Flow Fields for Perception and Prediction in Self-Driving
Coaching a Teachable Student
BEV-SAN: Accurate BEV 3D Object Detection via Slice Attention Networks
Center Focusing Network for Real-Time LiDAR Panoptic Segmentation
IPCC-TP: Utilizing Incremental Pearson Correlation Coefficient for Joint Multi-Agent Trajectory Prediction
Weakly Supervised Monocular 3D Object Detection using Multi-View Projection and Direction Consistency
CXTrack: Improving 3D Point Cloud Tracking with Contextual Information
ReasonNet: End-to-End Driving with Temporal and Global Reasoning
Seeing with Sound: Long-Range Acoustic Beamforming for Multimodal Scene Understanding
LinK: Linear Kernel for LiDAR-based 3D Perception
Understanding the Robustness of 3D Object Detection with Bird's-Eye-View Representations in Autonomous Driving
Tri-Perspective View for Vision-based 3D Semantic Occupancy Prediction
SkyEye: Self-Supervised Bird's-Eye-View Semantic Mapping using Monocular Frontal View Images
BEV-LaneDet: An Efficient 3D Lane Detection based on Virtual Camera via Key-Points
OcTr: Octree-based Transformer for 3D Object Detection
Instant Domain Augmentation for LiDAR Semantic Segmentation
ViP3D: End-to-End Visual Trajectory Prediction via 3D Agent Queries
UniSim: A Neural Closed-Loop Sensor Simulator
Learning Compact Representations for LiDAR Completion and Generation
Towards Unsupervised Object Detection from LiDAR Point Clouds
Standing Between Past and Future: Spatio-Temporal Modeling for Multi-Camera 3D Multi-Object Tracking
Benchmarking Robustness of 3D Object Detection to Common Corruptions in Autonomous Driving
X³KD: Knowledge Distillation Across Modalities, Tasks and Stages for Multi-Camera 3D Object Detection
PeakConv: Learning Peak Receptive Field for Radar Semantic Segmentation
GD-MAE: Generative Decoder for MAE Pre-Training on LiDAR Point Clouds
Neural Map Prior for Autonomous Driving
Lift3D: Synthesize 3D Training Data by Lifting 2D GAN to 3D Generative Radiance Field
Continuous Pseudo-Label Rectified Domain Adaptive Semantic Segmentation with Implicit Neural Representations
Single Domain Generalization for LiDAR Semantic Segmentation
Uncertainty-Aware Vision-based Metric Cross-View Geolocalization
MixSim: A Hierarchical Framework for Mixed Reality Traffic Simulation
PillarNeXt: Rethinking Network Designs for 3D Object Detection in LiDAR Point Clouds
Uni3D: A Unified Baseline for Multi-Dataset 3D Object Detection
CAPE: Camera View Position Embedding for Multi-View 3D Object Detection
LiDAR-in-the-Loop Hyperparameter Optimization
Bi3D: Bi-Domain Active Learning for Cross-Domain 3D Object Detection
FEND: A Future Enhanced Distribution-Aware Contrastive Learning Framework for Long-Tail Trajectory Prediction
Temporal Consistent 3D LiDAR Representation Learning for Semantic Perception in Autonomous Driving
Density-Insensitive Unsupervised Domain Adaption on 3D Object Detection
SGLoc: Scene Geometry Encoding for Outdoor LiDAR Localization
TBP-Former: Learning Temporal Bird's-Eye-View Pyramid for Joint Perception and Prediction in Vision-Centric Autonomous Driving
Localized Semantic Feature Mixers for Efficient Pedestrian Detection in Autonomous Driving
Deep Dive into Gradients: Better Optimization for 3D Object Detection with Gradient-corrected IoU Supervision
ProphNet: Efficient Agent-Centric Motion Forecasting with Anchor-informed Proposals
BEVHeight: A Robust Framework for Vision-based Roadside 3D Object Detection
VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion
Hidden Gems: 4D Radar Scene Flow Learning using Cross-Modal Supervision
Self-Supervised Image-to-Point Distillation via Semantically Tolerant Contrastive Loss
Query-Centric Trajectory Prediction
Efficient Hierarchical Entropy Model for Learned Point Cloud Compression
Novel Class Discovery for 3D Point Cloud Semantic Segmentation
MotionDiffuser: Controllable Multi-Agent Motion Prediction using Diffusion
FJMP: Factorized Joint Multi-Agent Motion Prediction over Learned Directed Acyclic Interaction Graphs

Self-Supervised or Unsupervised Representation Learning

Title	Repo	Video
SimpleNet: A Simple Network for Image Anomaly Detection and Localization
Masked Image Modeling with Local Multi-Scale Reconstruction
AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders		➖
ActMAD: Activation Matching to Align Distributions for Test-Time-Training
Rethinking Out-of-Distribution (OOD) Detection: Masked Image Modeling is All You Need		➖
DLBD: A Self-Supervised Direct-Learned Binary Descriptor		➖
Cut and Learn for Unsupervised Object Detection and Instance Segmentation
Unsupervised Deep Probabilistic Approach for Partial Point Cloud Registration		➖
Masked Motion Encoding for Self-Supervised Video Representation Learning		➖
Stare at what You See: Masked Image Modeling without Reconstruction
Hard Patches Mining for Masked Image Modeling
Multi-Mode Online Knowledge Distillation for Self-Supervised Visual Representation Learning
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale		➖
MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis
Token Boosting for Robust Self-Supervised Visual Transformer Pre-Training	➖
Unsupervised 3D Point Cloud Representation Learning by Triangle Constrained Contrast for Autonomous Driving	➖	➖
Masked Auto-Encoders Meet Generative Adversarial Networks and Beyond	➖	➖
Integrally Pre-trained Transformer Pyramid Networks
Mixed Autoencoder for Self-Supervised Visual Representation Learning	➖
Correlational Image Modeling for Self-Supervised Visual Pre-Training		➖
Shepherding Slots to Objects: Towards Stable and Robust Object-Centric Learning		➖
Deep Fair Clustering via Maximizing and Minimizing Mutual Information: Theory, Algorithm and Metric		➖
Evolved Part Masking for Self-Supervised Learning		➖
Change-Aware Sampling and Contrastive Learning for Satellite Images	➖	➖
Learning Common Rationale to Improve Self-Supervised Representation for Fine-grained Visual Recognition Problems		➖
DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks		➖
RILS: Masked Visual Reconstruction in Language Semantic Space		➖
Three Guidelines You Should Know for Universally Slimmable Self-Supervised Learning
BASiS: Batch Aligned Spectral Embedding Space	➖	➖
Co-Salient Object Detection with Uncertainty-Aware Group Exchange-Masking	➖	➖
Hyperbolic Contrastive Learning for Visual Representations beyond Objects		➖
Active Finetuning: Exploiting Annotation Budget in the Pretraining-Finetuning Paradigm		➖
MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-based Self-Supervised Pre-Training
OmniAL: A Unified CNN Framework for Unsupervised Anomaly Localization	➖	➖
TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models		➖
ALSO: Automotive Lidar Self-Supervision by Occupancy Estimation
Non-Contrastive Unsupervised Learning of Physiological Signals from Video
CrOC: Cross-View Online Clustering for Dense Visual Representation Learning
MOVES: Manipulated Objects in Video Enable Segmentation
Self-Supervised Representation Learning for CAD
Movies2Scenes: using Movie Metadata to Learn Scene Representation
PointCMP: Contrastive Mask Prediction for Self-Supervised Learning on Point Cloud Videos
Texture-guided Saliency Distilling for Unsupervised Salient Object Detection
Multi-Realism Image Compression with a Conditional Generator
Understanding Masked Autoencoders via Hierarchical Latent Variable Models
GeoMAE: Masked Geometric Target Prediction for Self-Supervised Point Cloud Pre-Training
Siamese DETR
Generalizable Implicit Neural Representations via Instance Pattern Composers
Pose-Disentangled Contrastive Learning for Self-Supervised Facial Representation
OT-Filter: an Optimal Transport Filter for Learning with Noisy Labels
Teacher-generated Spatial-Attention Labels Boost Robustness and Accuracy of Contrastive Models
Spatio-Temporal Self-Supervised Learning for Point Clouds in the Wild
BKinD-3D: Self-Supervised 3D Keypoint Discovery from Multi-View Videos
Learning Decorrelated Representations Efficiently using Fast Fourier Transform
Beyond Appearance: A Semantic Controllable Self-Supervised Learning Framework for Human-Centric Visual Tasks
Learning Geometry-Aware Representations by Sketching
Improving Visual Representation Learning through Perceptual Understanding
MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers
Unsupervised Object Localization: Observing the Background to Discover Objects
MCF: Mutual Correction Framework for Semi-Supervised Medical Image Segmentation
DivClust: Controlling Diversity in Deep Clustering
On Data Scaling in Masked Image Modeling
Revealing the Dark Secrets of Masked Image Modeling
Open-Set Representation Learning through Combinatorial Embedding
Coreset Sampling from Open-Set for Fine-grained Self-Supervised Learning
ToThePoint: Efficient Contrastive Learning of 3D Point Clouds via Recycling
MetaViewer: Towards a Unified Multi-View Representation
Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
Understanding Masked Image Modeling via Learning Occlusion Invariant Feature
CHMATCH: Contrastive Hierarchical Matching and Robust Adaptive Threshold Boosted Semi-Supervised Learning
Regularize Implicit Neural Representation by Itself

Datasets and Evaluation

Title	Repo	Paper	Video
Large-Scale Training Data Search for Object Re-Identification
Class Prototypes based Contrastive Learning for Classifying Multi-Label and Fine-grained Educational Videos
V2X-Seq: A Large-Scale Sequential Dataset for Vehicle-Infrastructure Cooperative Perception and Forecasting
NewsNet: A Novel Dataset for Hierarchical Temporal Segmentation
CLOTH4D: A Dataset for Clothed Human Reconstruction
Accelerating Dataset Distillation via Model Augmentation
ImageNet-E: Benchmarking Neural Network Robustness via Attribute Editing
Visual Atoms: Pre-Training Vision Transformers with Sinusoidal Waves
Infinite Photorealistic Worlds using Procedural Generation
CelebV-Text: A Large-Scale Facial Text-Video Dataset
Spring: A High-Resolution High-Detail Dataset and Benchmark for Scene Flow, Optical Flow and Stereo
Connecting Vision and Language with Video Localized Narratives
Towards Artistic Image Aesthetics Assessment: A Large-scale Dataset and a New Method
MD-VQA: Multi-Dimensional Quality Assessment for UGC Live Videos
Toward RAW Object Detection: A New Benchmark and A New Model
Objaverse: A Universe of Annotated 3D Objects
Habitat-Matterport 3D Semantics Dataset
Similarity Metric Learning for RGB-Infrared Group Re-Identification
MISC210K: A Large-Scale Dataset for Multi-Instance Semantic Correspondence
WeatherStream: Light Transport Automation of Single Image Deweathering
MobileBrick: Building LEGO for 3D Reconstruction on Mobile Devices
GeoNet: Benchmarking Unsupervised Adaptation Across Geographies
Logical Consistency and Greater Descriptive Power for Facial Hair Attribute Learning
PACO: Parts and Attributes of Common Objects
Understanding Deep Generative Models with Generalized Empirical Likelihoods
BEDLAM: A Synthetic Dataset of Bodies Exhibiting Detailed Lifelike Animated Motion
Unicode Analogies: An Anti-Objectivist Visual Reasoning Challenge
A New Comprehensive Benchmark for Semi-Supervised Video Anomaly Detection and Anticipation
An In-Depth Exploration of Person Re-Identification and Gait Recognition in Cloth-Changing Conditions
Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation
BiasBed – Rigorous Texture Bias Evaluation
A Large-Scale Homography Benchmark
Exploring and Utilizing Pattern Imbalance
Full or Weak Annotations? An Adaptive Strategy for Budget-constrained Annotation Campaigns
ReLight My NeRF: A Dataset for Novel View Synthesis and Relighting of Real World Objects
Open-Vocabulary Attribute Detection
Visual DNA: Representing and Comparing Images using Distributions of Neuron Activations
Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective
An Image Quality Assessment Dataset for Portraits
Multi-Sensor Large-Scale Dataset for Multi-View 3D Reconstruction
3D-POP - An Automated Annotation Approach to Facilitate Markerless 2D-3D Tracking of Freely Moving Birds with Marker-based Motion Capture
Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation
Visual Localization using Imperfect 3D Models from the Internet
Fantastic Breaks: A Dataset of Paired 3D Scans of Real-World Broken Objects and Their Complete Counterparts
StarCraftImage: A Dataset for Prototyping Spatial Reasoning Methods for Multi-Agent Environments
MammalNet: A Large-Scale Video Benchmark for Mammal Recognition and Behavior Understanding
A Large-Scale Robustness Analysis of Video Action Recognition Models
Affection: Learning Affective Explanations for Real-World Visual Data
ShapeTalk: A Language Dataset and Framework for 3D Shape Edits and Deformations
Deep Depth Estimation from Thermal Image
DF-Platter: Multi-Face Heterogeneous Deepfake Dataset
A New Dataset based on Images Taken by Blind People for Testing the Robustness of Image Classification Models Trained for ImageNet Categories
RealImpact: A Dataset of Impact Sound Fields for Real Objects
NICO⁺⁺: Towards Better Benchmarking for Domain Generalization

Scene Analysis and Understanding

Title	Repo	Paper	Video
You Only Segment Once: Towards Real-Time Panoptic Segmentation
IS-GGT: Iterative Scene Graph Generation with Generative Transformers
Disentangling Orthogonal Planes for Indoor Panoramic Room Layout Estimation with Cross-Scale Distortion Awareness
Panoptic Video Scene Graph Generation
3D Spatial Multimodal Knowledge Accumulation for Scene Graph Prediction in Point Cloud
JacobiNeRF: NeRF Shaping with Mutual Information Gradients
Learning Geometric-Aware Properties in 2D Representation using Lightweight CAD Models, or Zero Real 3D Pairs
Learning and Aggregating Lane Graphs for Urban Automated Driving
MIME: Human-Aware 3D Scene Generation
Connecting the Dots: Floorplan Reconstruction using Two-Level Queries
NeRF-RPN: A General Framework for Object Detection in NeRFs
Relational Context Learning for Human-Object Interaction Detection
Symmetric Shape-Preserving Autoencoder for Unsupervised Real Scene Point Cloud Completion
Token Contrast for Weakly-Supervised Semantic Segmentation
MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with Informative-Preserved Reconstruction and Self-Distilled Consistency
Primitive Generation and Semantic-related Alignment for Universal Zero-Shot Segmentation
CLIP2Scene: Towards Label-Efficient 3D Scene Understanding by CLIP
Multispectral Video Semantic Segmentation: A Benchmark Dataset and Baseline
Optimal Transport Minimization: Crowd Localization on Density Maps for Semi-Supervised Counting
Indiscernible Object Counting in Underwater Scenes
Long Range Pooling for 3D Large-Scale Scene Understanding
Delivering Arbitrary-Modal Semantic Segmentation
Images Speak in Images: A Generalist Painter for In-Context Visual Learning
SCPNet: Semantic Scene Completion on Point Cloud
Content-Aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers
OpenScene: 3D Scene Understanding with Open Vocabularies
Devil's on the Edges: Selective Quad Attention for Scene Graph Generation
Delving into Shape-Aware Zero-Shot Semantic Segmentation
Category Query Learning for Human-Object Interaction Classification
Nerflets: Local Radiance Fields for Efficient Structure-Aware 3D Scene Representation from 2D Supervision
DejaVu: Conditional Regenerative Learning to Enhance Dense Prediction
SCOOP: Self-Supervised Correspondence and Optimization-based Scene Flow
Incremental 3D Semantic Scene Graph Prediction from RGB Sequences
PanelNet: Understanding 360 Indoor Environment via Panel Representation
Perspective Fields for Single Image Camera Calibration
Open-Category Human-Object Interaction Pre-Training via Language Modeling Framework
Fast Contextual Scene Graph Generation with Unbiased Context Augmentation
Diffusion-based Generation, Optimization, and Planning in 3D Scenes
TopNet: Transformer-based Object Placement Network for Image Compositing
Computational Flash Photography through Intrinsics
Probing Neural Representations of Scene Perception in a Hippocampally Dependent Task using Artificial Neural Networks
DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting
LEGO-Net: Learning Regular Rearrangements of Objects in Rooms
Open-Vocabulary Point-Cloud Object Detection without 3D Annotation
Weakly-Supervised Domain Adaptive Semantic Segmentation with Prototypical Contrastive Learning
ScanDMM: A Deep Markov Model of Scanpath Prediction for 360° Images
Canonical Fields: Self-Supervised Learning of Pose-Canonicalized Neural Fields
TempSAL - Uncovering Temporal Information for Deep Saliency Prediction
Probabilistic Debiasing of Scene Graphs
Towards Unified Scene Text Spotting based on Sequence Generation
Learning to Generate Language-Supervised and Open-Vocabulary Scene Graph using Pre-trained Visual-Semantic Space
Modular Memorability: Tiered Representations for Video Memorability Prediction
Where We Are and What We're Looking At: Query Based Worldwide Image Geo-Localization using Hierarchies and Scenes
HRDFuse: Monocular 360° Depth Estimation by Collaboratively Learning Holistic-with-Regional Depth Distributions

Adversarial Attack and Defense

Title	Repo	Paper	Video
TWINS: A Fine-Tuning Framework for Improved Transferability of Adversarial Robustness and Generalization
Sibling-Attack: Rethinking Transferable Adversarial Attacks Against Face Recognition
T-SEA: Transfer-based Self-Ensemble Attack on Object Detection
The Enemy of My Enemy is My Friend: Exploring Inverse Adversaries for Improving Adversarial Training
Trade-Off between Robustness and Accuracy of Vision Transformers
Physically Realizable Natural-Looking Clothing Textures Evade Person Detectors via 3D Modeling
Proximal Splitting Adversarial Attack for Semantic Segmentation
Feature Separation and Recalibration for Adversarial Robustness
Enhancing the Self-Universality for Transferable Targeted Attacks
Backdoor Defense via Adaptively Splitting Poisoned Dataset
Dynamic Generative Targeted Attacks with Pattern Injection
Exploring the Relationship between Architectural Design and Adversarially Robust Generalization
Discrete Point-Wise Attack Is Not Enough: Generalized Manifold Adversarial Attack for Face Recognition
Towards Benchmarking and Assessing Visual Naturalness of Physical World Adversarial Attacks
MaLP: Manipulation Localization using a Proactive Scheme
TrojDiff: Trojan Attacks on Diffusion Models with Diverse Targets
Minimizing Maximum Model Discrepancy for Transferable Black-Box Targeted Attacks
Improving Robust Generalization by Direct PAC-Bayesian Bound Minimization
AGAIN: Adversarial Training with Attribution Span Enlargement and Hybrid Feature Fusion
Backdoor Defense via Deconfounded Representation Learning
Adversarially Robust Neural Architecture Search for Graph Neural Networks
PointCert: Point Cloud Classification with Deterministic Certified Robustness Guarantees
Towards Compositional Adversarial Robustness: Generalizing Adversarial Training to Composite Semantic Perturbations
Physically Adversarial Infrared Patches with Learnable Shapes and Locations
Color Backdoor: A Robust Poisoning Attack in Color Space
Towards Effective Adversarial Textured 3D Meshes on Physical Face Recognition
Turning Strengths into Weaknesses: A Certified Robustness Inspired Attack Framework Against Graph Neural Networks
Randomized Adversarial Training via Taylor Expansion
Backdoor Cleansing with Unlabeled Data
The Best Defense is a Good Offense: Adversarial Augmentation Against Adversarial Attacks
Ensemble-based Blackbox Attacks on Dense Prediction
Defending Against Patch-based Backdoor Attacks on Self-Supervised Learning
Adversarial Robustness via Random Projection Filters
Boundary Unlearning: Rapid Forgetting of Deep Networks via Shifting the Decision Boundary
Physical-World Optical Adversarial Attacks on 3D Face Recognition
Black-Box Sparse Adversarial Attack via Multi-Objective Optimisation CVPR Proceedings
How to Backdoor Diffusion Models?
The Resource Problem of using Linear Layer Leakage Attack in Federated Learning
Efficient Loss Function by Minimizing the Detrimental Effect of Floating-Point Errors on Gradient-based Attacks
Detecting Backdoors in Pre-trained Encoders
Can't Steal? Cont-Steal! Contrastive Stealing Attacks Against Image Encoders
CFA: Class-Wise Calibrated Fair Adversarial Training
Towards Transferable Targeted Adversarial Examples
Hierarchical Fine-grained Image Forgery Detection and Localization
RIATIG: Reliable and Imperceptible Adversarial Text-to-Image Generation with Natural Prompts
SlowLiDAR: Increasing the Latency of LiDAR-based Detection using Adversarial Examples
Progressive Backdoor Erasing via Connecting Backdoor and Adversarial Attacks
Improving the Transferability of Adversarial Samples by Path-Augmented Method
Boosting Accuracy and Robustness of Student Models via Adaptive Adversarial Distillation
StyLess: Boosting the Transferability of Adversarial Examples
Introducing Competition to Boost the Transferability of Targeted Adversarial Examples through Clean Feature Mixup
Transferable Adversarial Attacks on Vision Transformers with Token Gradient Regularization
Jedi: Entropy-based Localization and Removal of Adversarial Patches
Seasoning Model Soups for Robustness to Adversarial and Natural Distribution Shifts
CUDA: Convolution-based Unlearnable Datasets
Demystifying Causal Features on Adversarial Examples and Causal Inoculation for Robust Network by Adversarial Instrumental Variable Regression
Generalist: Decoupling Natural and Robust Generalization
The Dark Side of Dynamic Routing Neural Networks: Towards Efficiency Backdoor Injection
Revisiting Residual Networks for Adversarial Robustness
Detecting Backdoors During the Inference Stage based on Corruption Robustness Consistency
Cooperation or Competition: Avoiding Player Domination for Multi-Target Robustness via Adaptive Budgets

Efficient and Scalable Vision

Title	Repo	Video
DisWOT: Student Architecture Search for Distillation WithOut Training		➖
Stitchable Neural Networks
NIRVANA: Neural Implicit Representations of Videos with Adaptive Networks and Autoregressive Patch-Wise Modeling	➖
ResFormer: Scaling ViTs with Multi-Resolution Training
PD-Quant: Post-Training Quantization based on Prediction Difference Metric
DepGraph: Towards any Structural Pruning		➖
Towards Professional Level Crowd Annotation of Expert Domain Data	➖
GENIE: Show Me the Data for Quantization		➖
Boost Vision Transformer with GPU-Friendly Sparsity and Quantization	➖
MobileOne: An Improved One Millisecond Mobile Backbone		➖
1% VS 100%: Parameter-Efficient Low Rank Adapter for Dense Predictions	➖	➖
Discriminator-Cooperated Feature Map Distillation for GAN Compression		➖
EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention
itKD: Interchange Transfer-based Knowledge Distillation for 3D Object Detection
Slimmable Dataset Condensation	➖
Dynamic Inference with Grounding based Vision and Language Models	➖
ScaleDet: A Scalable Multi-Dataset Object Detector	➖
Learning to Zoom and Unzoom
Generic-to-Specific Distillation of Masked Autoencoders
Post-Training Quantization on Diffusion Models		➖
Global Vision Transformer Pruning with Hessian-Aware Saliency	➖
Network Expansion for Practical Training Acceleration		➖
Compacting Binary Neural Networks by Sparse Kernel Selection		➖
PointDistiller: Structured Knowledge Distillation Towards Efficient and Compact 3D Detection
Practical Network Acceleration with Tiny Sets		➖
Memory-Friendly Scalable Super-Resolution via Rewinding Lottery Ticket Hypothesis		➖
Fast Point Cloud Generation with Straight Flows	➖	➖
Rethinking Federated Learning with Domain Shift: A Prototype View
Solving Oscillation Problem in Post-Training Quantization through a Theoretical Perspective		➖
ScaleKD: Distilling Scale-Aware Knowledge in Small Object Detector	➖	➖
Adaptive Channel Sparsity for Federated Learning under System Heterogeneity	➖
A-la-carte Prompt Tuning (APT): Combining Distinct Data Via Composable Prompting	➖
NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers	➖
NIPQ: Noise Proxy-based Integrated Pseudo-Quantization		➖
FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer		➖
SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer		➖
Efficient On-Device Training via Gradient Filtering		➖
Towards High-Quality and Efficient Video Super-Resolution via Spatial-Temporal Data Overfitting		➖
You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model
Adaptive Data-Free Quantization		➖
Train-Once-for-All Personalization	➖	➖
Neural Rate Estimator and Unsupervised Learning for Efficient Distributed Image Analytics in Split-DNN Models		➖
Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers
FFCV: Accelerating Training by Removing Data Bottlenecks		➖
Samples with Low Loss Curvature Improve Data Efficiency
Decentralized Learning with Multi-Headed Distillation	➖	➖
Bit-Shrinking: Limiting Instantaneous Sharpness for Improving Post-Training Quantization	➖	➖
Masked Autoencoders Enable Efficient Knowledge Distillers

Computational Imaging

Title	Repo	Paper	Video
Polarimetric iToF: Measuring High-Fidelity Depth through Scattering Media
All-in-Focus Imaging from Event Focal Stack
Learning Event Guided High Dynamic Range Video Reconstruction
Propagate and Calibrate: Real-Time Passive Non-Line-of-Sight Tracking
Efficient View Synthesis and 3D-based Multi-Frame Denoising with Multiplane Feature Representations
Occlusion-Free Scene Recovery via Neural Radiance Fields
Image Super-Resolution using T-Tetromino Pixels
Event-based Blurry Frame Interpolation under Blind Exposure
Decoupling-and-Aggregating for Image Exposure Correction
VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining
The Differentiable Lens: Compound Lens Search over Glass Surfaces and Materials for Object Detection
Megahertz Light Steering without Moving Parts
Text2Scene: Text-Driven Indoor Scene Stylization with Part-Aware Details
RankMix: Data Augmentation for Weakly Supervised Learning of Classifying whole Slide Images with Diverse Sizes and Imbalanced Categories
Guided Depth Super-Resolution by Deep Anisotropic Diffusion
K3DN: Disparity-Aware Kernel Estimation for Dual-Pixel Defocus Deblurring
Rawgment: Noise-Accounted RAW Augmentation Enables Recognition in a Wide Variety of Environments
Low-Light Image Enhancement via Structure Modeling and Guidance
Analyzing Physical Impacts using Transient Surface Wave Imaging
DC²: Dual-Camera Defocus Control by Learning to Refocus
pCON: Polarimetric Coordinate Networks for Neural Scene Representations
Joint HDR Denoising and Fusion: A Real-World Mobile HDR Image Dataset
NLOST: Non-Line-of-Sight Imaging with Transformer
1000 FPS HDR Video with a Spike-RGB Hybrid Camera
Thermal Spread Functions (TSF): Physics-guided Material Classification
Structured Kernel Estimation for Photon-Limited Deconvolution
EfficientSCI: Densely Connected Network with Space-Time Factorization for Large-Scale Video Snapshot Compressive Imaging
EvShutter: Transforming Events for Unconstrained Rolling Shutter Correction
Tunable Convolutions with Parametric Multi-Loss Optimization
Non-Line-of-Sight Imaging with Signal Superresolution Network
Few-Shot Non-Line-of-Sight Imaging with Signal-Surface Collaborative Regularization
`Seeing` Electric Network Frequency from Events
Realistic Saliency Guided Image Enhancement
Learned Image Compression with Mixed Transformer-CNN Architectures
Self-Supervised Blind Motion Deblurring with Deep Expectation Maximization
Solving 3D Inverse Problems using Pre-trained 2D Diffusion Models
Parallel Diffusion Models of Operator and Image for Blind Inverse Problems
Range-Nullspace Video Frame Interpolation with Focalized Motion Estimation
Combining Implicit-Explicit View Correlation for Light Field Semantic Segmentation
Document Image Shadow Removal Guided by Color-Aware Background
Kernel Aware Resampler
Polarized Color Image Denoising
Constructing Deep Spiking Neural Networks from Artificial Neural Networks with Knowledge Distillation
Role of Transients in Two-Bounce Non-Line-of-Sight Imaging
Inverting the Imaging Process by Learning an Implicit Camera Model
Deep Polarization Reconstruction with PDAVIS Events
A Unified Spatial-Angular Structured Light for Single-View Acquisition of Shape and Reflectance
Energy-Efficient Adaptive 3D Sensing
HDR Imaging with Spatially Varying Signal-to-Noise Ratios
Swept-Angle Synthetic Wavelength Interferometry
Passive Micron-Scale Time-of-Flight with Sunlight Interferometry
Implicit View-Time Interpolation of Stereo Videos using Multi-Plane Disparities and Non-Uniform Coordinates
Learning a Deep Color Difference Metric for Photographic Images

Video: Low-Level Analysis, Motion, and Tracking

Title	Repo	Paper	Video
Uncovering the Missing Pattern: Unified Framework Towards Trajectory Imputation and Prediction
Tracking Multiple Deformable Objects in Egocentric Videos
Tracking through Containers and Occluders in the Wild
TarViS: A Unified Approach for Target-based Video Segmentation
VideoTrack: Learning to Track Objects via Video Transformer
ARKitTrack: A New Diverse Dataset for Tracking using Mobile RGB-D Data
A Dynamic Multi-Scale Voxel Flow Network for Video Prediction
Representation Learning for Visual Object Tracking by Masked Appearance Transfer
EqMotion: Equivariant Multi-Agent Motion Prediction with Invariant Interaction Reasoning
Semi-Supervised Video Inpainting with Cycle Consistency Constraints
Generalized Relation Modeling for Transformer Tracking
Breaking the `Object` in Video Object Segmentation
Unifying Short and Long-Term Tracking with Graph Hierarchies
Simple Cues Lead to a Strong Multi-Object Tracker
Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation
MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors
SeqTrack: Sequence to Sequence Learning for Visual Object Tracking
Joint Visual Grounding and Tracking with Natural Language Specification
Boosting Video Object Segmentation via Space-Time Correspondence Learning
Visual Prompt Multi-Modal Tracking
OVTrack: Open-Vocabulary Multiple Object Tracking
TransFlow: Transformer as Flow Learner
Focus on Details: Online Multi-Object Tracking with Diverse Fine-grained Representation
Autoregressive Visual Tracking
Bootstrapping Objectness from Videos by Relaxed Common Fate and Visual Grouping
Tangentially Elongated Gaussian Belief Propagation for Event-based Incremental Optical Flow Estimation
Bridging Search Region Interaction with Template for RGB-T Tracking
Efficient RGB-T Tracking via Cross-Modality Distillation
MotionTrack: Learning Robust Short-Term and Long-Term Motions for Multi-Object Tracking
Self-Supervised AutoFlow
UTM: A Unified Multiple Object Tracking Model with Identity-Aware Feature Enhancement
BiFormer: Learning Bilateral Motion Estimation via Bilateral Transformer for 4K Video Frame Interpolation
Spatial-then-Temporal Self-Supervised Learning for Video Correspondence
BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects
MED-VT: Multiscale Encoder-Decoder Video Transformer with Application to Object Segmentation
Context-Aware Relative Object Queries to Unify Video Instance and Panoptic Segmentation
Unsupervised Space-Time Network for Temporally-Consistent Segmentation of Multiple Motions
Resource-Efficient RGBD Aerial Tracking
MMVC: Learned Multi-Mode Video Compression with Block-based Prediction Mode Selection and Density-Adaptive Entropy Coding
Streaming Video Model
Weakly Supervised Class-Agnostic Motion Prediction for Autonomous Driving
LSTFE-Net: Long Short-Term Feature Enhancement Network for Video Small Object Detection
DistractFlow: Improving Optical Flow Estimation via Realistic Distractions and Pseudo-Labeling
SCOTCH and SODA: A Transformer Video Shadow Detection Framework
ZBS: Zero-Shot Background Subtraction via Instance-Level Background Modeling and Foreground Selection
Frame-Event Alignment and Fusion Network for High Frame Rate Tracking

Vision Applications and Systems

Title	Repo	Paper	Video
Context De-confounded Emotion Recognition
Intrinsic Physical Concepts Discovery with Object-Centric Predictive Models
Automatic High Resolution Wire Segmentation and Removal
Class Balanced Adaptive Pseudo Labeling for Federated Semi-Supervised Learning
Weakly Supervised Video Emotion Detection and Prediction via Cross-Modal Temporal Erasing Network
Probing Sentiment-Oriented Pre-Training Inspired by Human Sentiment Perception Mechanism
DIP: Dual Incongruity Perceiving Network for Sarcasm Detection
Adaptive Human Matting for Dynamic Videos
LayoutFormer++: Conditional Graphic Layout Generation via Constraint Serialization and Decoding Space Restriction
Prototypical Residual Networks for Anomaly Detection and Localization
Are Binary Annotations Sufficient? Video Moment Retrieval via Hierarchical Uncertainty-based Active Learning
Affordance Grounding from Demonstration Video to Target Image
Natural Language-Assisted Sign Language Recognition
CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning
Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation
Collaborative Noisy Label Cleaner: Learning Scene-Aware Trailers for Multi-Modal Highlight Detection in Movies
Open-Set Fine-grained Retrieval via Prompting Vision-Language Evaluator
Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking
Visual Exemplar Driven Task-Prompting for Unified Perception in Autonomous Driving
Exploiting Unlabelled Photos for Stronger Fine-grained SBIR
What Can Human Sketches Do for Object Detection?
Dynamic Conceptional Contrastive Learning for Generalized Category Discovery
Balanced Energy Regularization Loss for Out-of-Distribution Detection
Lite DETR : An Interleaved Multi-Scale Encoder for Efficient DETR
CLIP for All Things Zero-Shot Sketch-based Image Retrieval, Fine-grained or Not
PosterLayout: A New Benchmark and Approach for Content-Aware Visual-Textual Presentation Layout
Re-Thinking Federated Active Learning based on Inter-Class Diversity
Consistent-Teacher: Towards Reducing Inconsistent Pseudo-Targets in Semi-Supervised Object Detection
Cloud-Device Collaborative Adaptation to Continual Changing Environments in the Real-World
Bridging Precision and Confidence: A Train-Time Loss for Calibrating Object Detection
AccelIR: Task-Aware Image Compression for Accelerating Neural Restoration
Multiclass Confidence and Localization Calibration for Object Detection
Auto-CARD: Efficient and Robust Codec Avatar Driving for Real-Time Mobile Telepresence
Deep Random Projector: Accelerated Deep Image Prior
SIEDOB: Semantic Image Editing by Disentangling Object and Background

Vision and Graphics

Title	Repo	Paper	Video
NeUDF: Leaning Neural Unsigned Distance Fields with Volume Rendering
RaBit: Parametric Modeling of 3D Biped Cartoon Characters with a Topological-Consistent Dataset
DualVector: Unsupervised Vector Font Synthesis with Dual-Part Representation
Magic3D: High-Resolution Text-to-3D Content Creation
Pointersect: Neural Rendering with Cloud-Ray Intersection
Humans as Light Bulbs: 3D Human Reconstruction from Thermal Reflection
ABLE-NeRF: Attention-based Rendering with Learnable Embeddings for Neural Radiance Field
JAWS: Just A Wild Shot for Cinematic Transfer in Neural Radiance Fields
LayoutDM: Discrete Diffusion Model for Controllable Layout Generation
LightPainter: Interactive Portrait Relighting with Freehand Scribble
RODIN: A Generative Model for Sculpting 3D Digital Avatars using Diffusion
NerVE: Neural Volumetric Edges for Parametric Curve Extraction from Point Cloud
CAMS: CAnonicalized Manipulation Spaces for Category-Level Functional Hand-Object Manipulation Synthesis
VecFontSDF: Learning to Reconstruct and Synthesize High-Quality Vector Fonts via Signed Distance Functions
Generalized Deep 3D Shape Prior via Part-Discretized Diffusion Process
Latent-NeRF for Shape-guided Generation of 3D Shapes and Textures
Parts2Words: Learning Joint Embedding of Point Clouds and Texts by Bidirectional Matching between Parts and Words
Multiplicative Fourier Level of Detail
SECAD-Net: Self-Supervised CAD Reconstruction by Learning Sketch-Extrude Operations
Transfer4D: A Framework for Frugal Motion Capture and Deformation Transfer
Plateau-reduced Differentiable Path Tracing
3D Highlighter: Localizing Regions on 3D Shapes via Text Descriptions
Differentiable Shadow Mapping for Efficient Inverse Graphics
Inverse Rendering of Translucent Objects using Physical and Neural Renderers
MAIR: Multi-View Attention Inverse Rendering with 3D Spatially-Varying Lighting Estimation
Neural Fourier Filter Bank
UMat: Uncertainty-Aware Single Image High Resolution Material Capture
Neural Congealing: Aligning Images to a Joint Semantic Atlas
PlenVDB: Memory Efficient VDB-based Radiance Fields for Fast Training and Rendering
VectorFloorSeg: Two-Stream Graph Attention Network for Vectorized Roughcast Floorplan Segmentation
Learning to Render Novel Views from Wide-Baseline Stereo Pairs
CLIP-Sculptor: Zero-Shot Generation of High-Fidelity and Diverse Shapes from Natural Language

Robotics

Title	Repo	Paper	Video
Object-Goal Visual Navigation via Effective Exploration of Relations among Historical Navigation States
TTA-COPE: Test-Time Adaptation for Category-Level Object Pose Estimation
Meta-Explore: Exploratory Hierarchical Vision-and-Language Navigation using Scene Object Spectrum Grounding
Learning Human-to-Robot Handovers from Point Clouds
Robot Structure Prior Guided Temporal Attention for Camera-to-Robot Pose Estimation from Image Sequence
PartManip: Learning Cross-Category Generalizable Part Manipulation Policy from Point Cloud Observations
DexArt: Benchmarking Generalizable Dexterous Manipulation with Articulated Objects
PyPose: A Library for Robot Learning with Physics-based Optimization
Target-Referenced Reactive Grasping for Dynamic Objects
Autonomous Manipulation Learning for Similar Deformable Objects via only One Demonstration
Renderable Neural Radiance Map for Visual Navigation
Efficient Map Sparsification based on 2D and 3D Discretized Grids
Policy Adaptation from Foundation Model Feedback
NeRF in the Palm of Your Hand: Corrective Augmentation for Robotics via Novel-View Synthesis
Markerless Camera-to-Robot Pose Estimation via Self-Supervised Sim-to-Real Transfer
Affordances from Human Videos as a Versatile Representation for Robotics
DeepMapping2: Self-Supervised Large-Scale LiDAR Map Optimization
GrowSP: Unsupervised Semantic Segmentation of 3D Point Clouds
Neural Volumetric Memory for Visual Locomotion Control
Multi-Object Manipulation via Object-Centric Neural Scattering Functions
Local-guided Global: Paired Similarity Representation for Visual Reinforcement Learning
HypLiLoc: Towards Effective LiDAR Pose Regression with Hyperbolic Fusion
Imitation Learning as State Matching via Differentiable Physics

Transparency, Fairness, Accountability, Privacy, Ethics in Vision

Title	Repo	Paper	Video
Effective Ambiguity Attack Against Passport-based DNN Intellectual Property Protection Schemes through Fully Connected Layer Substitution
Progressive Open Space Expansion for Open-Set Model Attribution
Breaching FedMD: Image Recovery via Paired-Logits Inversion Attack
DartBlur: Privacy Preservation with Detection Artifact Suppression
Reinforcement Learning-based Black-Box Model Inversion Attacks
Model-Agnostic Gender Debiased Image Captioning
Uncurated Image-Text Datasets: Shedding Light on Demographic Bias
AltFreezing for more General Video Face Forgery Detection
Make Landscape Flatter in Differentially Private Federated Learning
DynaFed: Tackling Client Data Heterogeneity with Global Dynamics
Re-Thinking Model Inversion Attacks Against Deep Neural Networks
Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models
TrojViT: Trojan Insertion in Vision Transformers
Difficulty-based Sampling for Debiased Contrastive Representation Learning
Model Barrier: A Compact Un-Transferable Isolation Domain for Model Intellectual Property Protection
Fair Scratch Tickets: Finding Fair Sparse Networks without Weight Training
CLIP2Protect: Protecting Facial Privacy using Text-guided Makeup via Adversarial Latent Search
Bias in Pruned Vision Models: In-Depth Analysis and Countermeasures
Learning to Generate Image Embeddings with User-Level Differential Privacy
Bias Mimicking: A Simple Sampling Approach for Bias Mitigation
CaPriDe Learning: Confidential and Private Decentralized Learning based on Encryption-Friendly Distillation Loss
DeAR: Debiasing Vision-Language Models with Additive Residuals
Deep Deterministic Uncertainty: A New Simple Baseline
Manipulating Transfer Learning for Property Inference
Training Debiased Subnetworks with Contrastive Weight Pruning
Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models
STDLens: Model Hijacking-Resilient Federated Learning for Object Detection
Architectural Backdoors in Neural Networks
MEDIC: Remove Model Backdoors via Importance Driven Cloning
Learning Debiased Representations via Conditional Attribute Interpolation

Explainable Computer Vision

Title	Repo	Video
Are Data-Driven Explanations Robust Against Out-of-Distribution Data?
Uncertainty-Aware Unsupervised Image Deblurring with Deep Residual Prior
Teaching Matters: Investigating the Role of Supervision in Vision Transformers		➖
Adversarial Counterfactual Visual Explanations
SketchXAI: A First Look at Explainability for Human Sketches
Doubly Right Object Recognition: A why Prompt for Visual Rationales
Overlooked Factors in Concept-based Explanations: Dataset Choice, Concept Learnability, and Human Capability
Initialization Noise in Image Gradients and Saliency Maps
Learning Bottleneck Concepts in Image Classification
Zero-Shot Model Diagnosis
OCTET: Object-Aware Counterfactual Explanations		➖
X-Pruner: eXplainable Pruning for Vision Transformers
Don't Lie to Me! Robust and Efficient Explainability with Verified Perturbation Analysis	➖	➖
CRAFT: Concept Recursive Activation FacTorization for Explainability		➖
Grounding Counterfactual Explanation of Image Classifiers to Textual Concept Space	➖	➖
Explaining Image Classifiers with Multiscale Directional Image Representation		➖
IDGI: A Framework to Eliminate Explanation Noise from Integrated Gradients
Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification
Gradient-based Uncertainty Attribution for Explainable Bayesian Deep Learning	➖
PIP-Net: Patch-based Intuitive Prototypes for Interpretable Image Classification
Shortcomings of Top-Down Randomization-based Sanity Checks for Evaluations of Deep Neural Network Explanations	➖
Spatial-Temporal Concept based Explanation of 3D ConvNets
A Practical Upper Bound for the Worst-Case Attribution Deviations	➖	➖
Adversarial Normalization: I Can Visualize Everything (ICE)

Embodied Vision: Active Agents, Simulation

Title	Repo	Paper	Video
Open-World Multi-Task Control through Goal-Aware Representation Learning and Adaptive Horizon Prediction
Layout-based Causal Inference for Object Navigation
EC²: Emergent Communication for Embodied Control
GAPartNet: Cross-Category Domain-Generalizable Object Perception and Manipulation via Generalizable and Actionable Parts
Phone2Proc: Bringing Robust Robots into Our Chaotic World
PIRLNav: Pretraining with Imitation and RL Finetuning for ObjectNav
CoWs on PASTURE: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation
3D-Aware Object Goal Navigation via Simultaneous Exploration and Identification
Modality-Invariant Visual Odometry for Embodied Vision
UniDexGrasp: Universal Robotic Dexterous Grasping via Learning Diverse Proposal Generation and Goal-Conditioned Policy
EXCALIBUR: Encouraging and Evaluating Embodied Exploration
Leverage Interactive Affinity for Affordance Learning
LANA: A Language-Capable Navigator for Instruction Following and Generation
Galactic: Scaling End-to-End Reinforcement Learning for Rearrangement at 100k Steps-Per-Second

Document Analysis and Understanding

Title	Repo	Paper	Video
Towards Flexible Multi-Modal Document Models
Improving Table Structure Recognition with Visual-Alignment Sequential Coordinate Modeling
Unifying Layout Generation with a Decoupled Diffusion Model
Conditional Text Image Generation with Diffusion Models
Turning a CLIP Model into a Scene Text Detector
Unifying Vision, Text, and Layout for Universal Document Processing
Modeling Entities as Semantic Points for Visual Information Extraction in the Wild
GeoLayoutLM: Geometric Pre-Training for Visual Information Extraction
Handwritten Text Generation from Visual Archetypes
Towards Robust Tampered Text Detection in Document Image: New dataset and New Solution
M^{6^{Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout, Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout Analysis}}
Disentangling Writer and Character Styles for Handwriting Generation

Machine Learning (other than Deep Learning)

Title	Repo	Paper	Video
Deep Incomplete Multi-View Clustering with Cross-View Partial Sample and Prototype Alignment
Towards Better Decision Forests: Forest Alternating Optimization
Class Adaptive Network Calibration
Defining and Quantifying the Emergence of Sparse Concepts in DNNs
MOT Masked Optimal Transport for Partial Domain Adaptation
Adaptive Graph Convolutional Subspace Clustering
Reliable and Interpretable Personalized Federated Learning
Confidence-Aware Personalized Federated Learning via Variational Expectation Maximization
Efficient Verification of Neural Networks Against LVM-based Specifications
You Are Catching My Attention: Are Vision Transformers Bad Learners under Backdoor Attacks?
Leveraging Inter-Rater Agreement for Classification in the Presence of Noisy Labels
Sliced Optimal Partial Transport
A Meta-Learning Approach to Predicting Performance and Data Requirements
Towards Effective Visual Representations for Partial-Label Learning

Physics-based Vision and Shape-from-X

Title	Repo	Paper	Video
Learning Anchor Transformations for 3D Garment Animation
High-Fidelity Event-Radiance Recovery via Transient Event Frequency
Complementary Intrinsics from Neural Radiance Fields and CNNs for Outdoor Scene Relighting
Fresnel Microfacet BRDF: Unification of Polari-Radiometric Surface-Body Reflection
Event-based Shape from Polarization
Weakly-Supervised Single-View Image Relighting
DANI-Net: Uncalibrated Photometric Stereo by Differentiable Shadow Handling, Anisotropic Reflectance Modeling, and Neural Inverse Rendering
Learning Accurate 3D Shape based on Stereo Polarimetric Imaging
Visibility Constrained Wide-Band Illumination Spectrum Design for Seeing-in-the-Dark
Light Source Separation and Intrinsic Image Decomposition under AC Illumination
OReX: Object Reconstruction from Planar Cross-Sections using Neural Fields
Unsupervised Intrinsic Image Decomposition with LiDAR Intensity

Biometrics

Title	Repo	Video
Instance-Aware Domain Generalization for Face Anti-Spoofing
OpenGait: Revisiting Gait Recognition Toward Better Practicality		➖
Recognizability Embedding Enhancement for Very Low-Resolution Face Recognition and Quality Estimation	➖
GaitGCI: Generative Counterfactual Intervention for Gait Recognition	➖
Rethinking Domain Generalization for Face Anti-Spoofing: Separability and Alignment
AstroNet: When Astrocyte Meets Artificial Neural Network	➖	➖
DCFace: Synthetic Face Generation with Dual Condition Diffusion Model
LidarGait: Benchmarking 3D Gait Recognition with Point Clouds		➖
CR-FIQA: Face Image Quality Assessment by Learning Sample Relative Classifiability
Dual-Bridging with Adversarial Noise Generation for Domain Adaptive rPPG Estimation	➖
Evading Forensic Classifiers with Attribute-Conditioned Adversarial Faces

Optimization Methods (other than Deep Learning)

Title	Repo	Paper	Video
Pose Synchronization under Multiple Pair-Wise Relative Poses
Adaptive Global Decay Process for Event Cameras
Wide-Angle Rectification via Content-Aware Conformal Mapping
On the Convergence of IRLS and its Variants in Outlier-Robust Estimation
A General Regret Bound of Preconditioned Gradient Method for DNN Training
Robust and Scalable Gaussian Process Regression and its Applications
EMT-NAS: Transferring Architectural Knowledge between Tasks from Different Datasets
Transformer-based Learned Optimization
Efficient Robust Principal Component Analysis via Block Krylov Iteration and CUR Decomposition
Solving Relaxations of MAP-MRF Problems: Combinatorial In-Face Frank-Wolfe Directions
Robust Generalization Against Photon-Limited Corruptions via Worst-Case Sharpness Minimization
Elastic Aggregation for Federated Optimization

Photogrammetry and Remote Sensing

Title	Repo	Paper	Video
MethaneMapper: Spectral Absorption Aware Hyperspectral Transformer for Methane Detection
Probability-based Global Cross-Modal Upsampling for Pansharpening
Learning Correspondence Uncertainty via Differentiable Nonlinear Least Squares
Dynamic Coarse-to-Fine Learning for Oriented Tiny Object Detection
ViTs for SITS: Vision Transformers for Satellite Image Time Series
Quantum-Inspired Spectral-Spatial Pyramid Network for Hyperspectral Image Classification
TopDiG: Class-Agnostic Topological Directional Graph Extraction from Remote Sensing Images
OmniCity: Omnipotent City Understanding with Multi-Level and Multi-View Images

Computer Vision Theory

Title	Repo	Paper	Video
Neural Dependencies Emerging from Learning Massive Categories
Gaussian Label Distribution Learning for Spherical Image Object Detection
Unbalanced Optimal Transport: A Unified Framework for Object Detection
DropKey for Vision Transformer
SplineCam: Exact Visualization and Characterization of Deep Network Geometry and Decision Boundaries

Computer Vision for Social Good

Title	Repo	Video
Unlearnable Clusters: Towards Label-Agnostic Unlearnable Examples		➖
On the Difficulty of Unpaired Infrared-to-Visible Video Translation: Fine-grained Content-Rich Patches Transfer	➖
SCConv: Spatial and Channel Reconstruction Convolution for Feature Redundancy	➖
TruFor: Leveraging All-Round Clues for Trustworthy Image Forgery Detection and Localization		➖
Angelic Patches for Improving Third-Party Object Detector Performance		➖

Others

Title	Repo	Paper	Video
A Bag-of-Prototypes Representation for Dataset-Level Applications
Learning to Retain while Acquiring: Combating Distribution-Shift in Adversarial Data-Free Knowledge Distillation
Label Information Bottleneck for Label Enhancement
DISC: Learning from Noisy Labels via Dynamic Instance-Specific Selection and Correction
Restoration of Hand-Drawn Architectural Drawings using Latent Space Mapping with Degradation Generator
DaFKD: Domain-Aware Federated Knowledge Distillation
Enhanced Stable View Synthesis
ScaleFL: Resource-Adaptive Federated Learning with Heterogeneous Clients
GradMA: A Gradient-Memory-based Accelerated Federated Learning with Alleviated Catastrophic Forgetting
High-Resolution Image Reconstruction with Latent Diffusion Models from Human Brain Activity
A Unified Knowledge Distillation Framework for Deep Directed Graphical Models
How to Prevent the Poor Performance Clients for Personalized Federated Learning?

Files

README.md

Latest commit

History