chore: update confs

actions-user · actions-user · commit 627c516251e3 · 2025-04-01T00:53:27.000Z
diff --git a/arxiv.json b/arxiv.json
@@ -29272,5 +29272,33 @@
         "pub_date": "2025-03-27",
         "summary": "Four-dimensional computed tomography (4D CT) reconstruction is crucial for\ncapturing dynamic anatomical changes but faces inherent limitations from\nconventional phase-binning workflows. Current methods discretize temporal\nresolution into fixed phases with respiratory gating devices, introducing\nmotion misalignment and restricting clinical practicality. In this paper, We\npropose X$^2$-Gaussian, a novel framework that enables continuous-time 4D-CT\nreconstruction by integrating dynamic radiative Gaussian splatting with\nself-supervised respiratory motion learning. Our approach models anatomical\ndynamics through a spatiotemporal encoder-decoder architecture that predicts\ntime-varying Gaussian deformations, eliminating phase discretization. To remove\ndependency on external gating devices, we introduce a physiology-driven\nperiodic consistency loss that learns patient-specific breathing cycles\ndirectly from projections via differentiable optimization. Extensive\nexperiments demonstrate state-of-the-art performance, achieving a 9.93 dB PSNR\ngain over traditional methods and 2.25 dB improvement against prior Gaussian\nsplatting techniques. By unifying continuous motion modeling with hardware-free\nperiod learning, X$^2$-Gaussian advances high-fidelity 4D CT reconstruction for\ndynamic clinical imaging. Project website at: https://x2-gaussian.github.io/.",
         "translated": ""
+    },
+    {
+        "title": "Q-Insight: Understanding Image Quality via Visual Reinforcement Learning",
+        "url": "http://arxiv.org/abs/2503.22679v1",
+        "pub_date": "2025-03-28",
+        "summary": "Image quality assessment (IQA) focuses on the perceptual visual quality of\nimages, playing a crucial role in downstream tasks such as image\nreconstruction, compression, and generation. The rapid advancement of\nmulti-modal large language models (MLLMs) has significantly broadened the scope\nof IQA, moving toward comprehensive image quality understanding that\nincorporates content analysis, degradation perception, and comparison reasoning\nbeyond mere numerical scoring. Previous MLLM-based methods typically either\ngenerate numerical scores lacking interpretability or heavily rely on\nsupervised fine-tuning (SFT) using large-scale annotated datasets to provide\ndescriptive assessments, limiting their flexibility and applicability. In this\npaper, we propose Q-Insight, a reinforcement learning-based model built upon\ngroup relative policy optimization (GRPO), which demonstrates strong visual\nreasoning capability for image quality understanding while requiring only a\nlimited amount of rating scores and degradation labels. By jointly optimizing\nscore regression and degradation perception tasks with carefully designed\nreward functions, our approach effectively exploits their mutual benefits for\nenhanced performance. Extensive experiments demonstrate that Q-Insight\nsubstantially outperforms existing state-of-the-art methods in both score\nregression and degradation perception tasks, while exhibiting impressive\nzero-shot generalization to comparison reasoning tasks. Code will be available\nat https://github.com/lwq20020127/Q-Insight.",
+        "translated": ""
+    },
+    {
+        "title": "DSO: Aligning 3D Generators with Simulation Feedback for Physical\n  Soundness",
+        "url": "http://arxiv.org/abs/2503.22677v1",
+        "pub_date": "2025-03-28",
+        "summary": "Most 3D object generators focus on aesthetic quality, often neglecting\nphysical constraints necessary in applications. One such constraint is that the\n3D object should be self-supporting, i.e., remains balanced under gravity.\nPrior approaches to generating stable 3D objects used differentiable physics\nsimulators to optimize geometry at test-time, which is slow, unstable, and\nprone to local optima. Inspired by the literature on aligning generative models\nto external feedback, we propose Direct Simulation Optimization (DSO), a\nframework to use the feedback from a (non-differentiable) simulator to increase\nthe likelihood that the 3D generator outputs stable 3D objects directly. We\nconstruct a dataset of 3D objects labeled with a stability score obtained from\nthe physics simulator. We can then fine-tune the 3D generator using the\nstability score as the alignment metric, via direct preference optimization\n(DPO) or direct reward optimization (DRO), a novel objective, which we\nintroduce, to align diffusion models without requiring pairwise preferences.\nOur experiments show that the fine-tuned feed-forward generator, using either\nDPO or DRO objective, is much faster and more likely to produce stable objects\nthan test-time optimization. Notably, the DSO framework works even without any\nground-truth 3D objects for training, allowing the 3D generator to self-improve\nby automatically collecting simulation feedback on its own outputs.",
+        "translated": ""
+    },
+    {
+        "title": "TranSplat: Lighting-Consistent Cross-Scene Object Transfer with 3D\n  Gaussian Splatting",
+        "url": "http://arxiv.org/abs/2503.22676v1",
+        "pub_date": "2025-03-28",
+        "summary": "We present TranSplat, a 3D scene rendering algorithm that enables realistic\ncross-scene object transfer (from a source to a target scene) based on the\nGaussian Splatting framework. Our approach addresses two critical challenges:\n(1) precise 3D object extraction from the source scene, and (2) faithful\nrelighting of the transferred object in the target scene without explicit\nmaterial property estimation. TranSplat fits a splatting model to the source\nscene, using 2D object masks to drive fine-grained 3D segmentation. Following\nuser-guided insertion of the object into the target scene, along with automatic\nrefinement of position and orientation, TranSplat derives per-Gaussian radiance\ntransfer functions via spherical harmonic analysis to adapt the object's\nappearance to match the target scene's lighting environment. This relighting\nstrategy does not require explicitly estimating physical scene properties such\nas BRDFs. Evaluated on several synthetic and real-world scenes and objects,\nTranSplat yields excellent 3D object extractions and relighting performance\ncompared to recent baseline methods and visually convincing cross-scene object\ntransfers. We conclude by discussing the limitations of the approach.",
+        "translated": ""
+    },
+    {
+        "title": "Understanding Co-speech Gestures in-the-wild",
+        "url": "http://arxiv.org/abs/2503.22668v1",
+        "pub_date": "2025-03-28",
+        "summary": "Co-speech gestures play a vital role in non-verbal communication. In this\npaper, we introduce a new framework for co-speech gesture understanding in the\nwild. Specifically, we propose three new tasks and benchmarks to evaluate a\nmodel's capability to comprehend gesture-text-speech associations: (i)\ngesture-based retrieval, (ii) gestured word spotting, and (iii) active speaker\ndetection using gestures. We present a new approach that learns a tri-modal\nspeech-text-video-gesture representation to solve these tasks. By leveraging a\ncombination of global phrase contrastive loss and local gesture-word coupling\nloss, we demonstrate that a strong gesture representation can be learned in a\nweakly supervised manner from videos in the wild. Our learned representations\noutperform previous methods, including large vision-language models (VLMs),\nacross all three tasks. Further analysis reveals that speech and text\nmodalities capture distinct gesture-related signals, underscoring the\nadvantages of learning a shared tri-modal embedding space. The dataset, model,\nand code are available at: https://www.robots.ox.ac.uk/~vgg/research/jegal",
+        "translated": ""
     }
 ]