Conferences
https://doi.org/10.1145/3746058.3758397
Short-form videos offer advantages of rapid information acquisition and intuitive content consumption. However, the combination of brevity and infinite scroll structures exacerbates habitual overuse problems. Existing intervention studies have failed to achieve fundamental behavioral change by relying on simple usage limits and one-time notifications, taking a uniform approach that does not consider individual viewing contexts. To overcome these limitations, this study proposes SHIFT, which applies If-Then theory. SHIFT guides users to establish specific plans of “if situation X occurs, then I will do action Y” and tracks actual execution to induce automatic behavioral change. The system collects real-time scroll patterns, automatically classifies viewing content using Vision-Language Models, and generates personalized intervention messages through LLM-based four distinct prompting strategies. A 4-week user study confirmed a 50% reduction in average daily usage time (p < 0.001) and achieved an 83% (p < 0.001) intervention success rate.
If-Then Planning, Short-form Video, Self-regulation, Personal Informatics, Large Language Model, Human-Computer Interaction
https://doi.org/10.1145/3746058.3758397
Short-form videos offer advantages of rapid information acquisition and intuitive content consumption. However, the combination of brevity and infinite scroll structures exacerbates habitual overuse problems. Existing intervention studies have failed to achieve fundamental behavioral change by relying on simple usage limits and one-time notifications, taking a uniform approach that does not consider individual viewing contexts. To overcome these limitations, this study proposes SHIFT, which applies If-Then theory. SHIFT guides users to establish specific plans of “if situation X occurs, then I will do action Y” and tracks actual execution to induce automatic behavioral change. The system collects real-time scroll patterns, automatically classifies viewing content using Vision-Language Models, and generates personalized intervention messages through LLM-based four distinct prompting strategies. A 4-week user study confirmed a 50% reduction in average daily usage time (p < 0.001) and achieved an 83% (p < 0.001) intervention success rate.
If-Then Planning, Short-form Video, Self-regulation, Personal Informatics, Large Language Model, Human-Computer Interaction
https://doi.org/10.1145/3714394.3756347
Journaling is widely recognized for promoting stress reduction, emotional resilience, and goal setting. With the rise of Large Language Models (LLMs), digital journaling systems are now capable of providing adaptive and personalized support. However, many existing solutions fail to consider users' evolving emotional states and motivational goals. We present POCKET-MIND, an LLM-based journaling application powered by GPT-4. The system utilizes a dual-prompt design to facilitate both emotional exploration and goal-oriented reflection. In a one-week pilot study with 15 participants (ages 19–34), over 80% reported improvements in emotional awareness and goal progress. Notably, users showed increased journaling consistency and meaningful progress from the third day onward. Our findings suggest that LLM-based journaling systems can offer effective, personalized mental health support. This work contributes to the growing field of digital mental health by demonstrating the role of conversational AI in promoting psychological well-being and user engagement.
Large Language Models, Digital Journaling, Mental Health, Goal Reflection, Emotion-aware Interaction, Human-AI Collaboration
https://doi.org/10.1145/3714394.3756337
Existing counseling chatbots typically determine flow transitions based on explicit user profiling or emotion recognition. However, in real-world mental health contexts, users often struggle to accurately recognize or verbalize their internal states, making per-turn inference both costly and unreliable. We propose a selective modeling strategy that monitors deviations in the chatbot's own persona (so-called persona drift) as an indirect signal of user state change. When such deviation is detected by the heuristic module, signaling a subtle misalignment between the user's implicit state and the chatbot's consistent persona, the user state is reassessed and the counseling flow is conditionally adjusted. To reflect actual therapeutic processes, we implement a modular counseling system grounded in the Transtheoretical Model for Change (TTM), with five chatbots tailored according to user's behavior readiness stage. Together, the TTM-based architecture with persona drift module form a lightweight yet adaptive framework for tracking user state and guiding conversation flow in mental health chatbots.
mental health chatbot; user state modeling; persona consistency
https://doi.org/10.1145/3714394.3756158
Breathing is a common entry point to mindfulness. But for beginners, simply observing the breath can feel vague and hard to follow. We present MindfulBreath, an interactive system that uses real-time respiratory data to recommend a personalized breathing pace and visualize each inhale and exhale as flowing movement through a human figure. Designed as a preparatory experience, MindfulBreath helps users rehearse attentional control and embodied focus before beginning body scan meditation. In a study with 15 participants, Initial findings suggest that MindfulBreath may improve focus, relaxation, and engagement over passive music or standard breathing apps. These preliminary results indicate the potential visually embodied breathing exercises effectively bridge to deeper mindfulness. Unlike conventional mindfulness apps and devices, which offer fixed timing or generic animations, MindfulBreath adapts in real time to each user's unique respiratory pattern, positioning breath not only as a target of observation, but as an accessible medium for embodied mental training.
Mindfulness; Breathing exercise; Biofeedback; Respiratory sensing; Interactive system; Human-centered computing
Table data are essential in multiple fields, particularly the financial domain, for tasks such as financial statement analysis. While large language models (LLMs) have advanced text-based research, they struggle with image-format data. Recent multimodal Large Language Models (multimodal LLMs) have demonstrated an ability to process text and images, but their table image processing performance remains limited and lacks domain specificity. Here, we introduce FinTab-LLaVA, a multimodal LLM designed for effective financial table processing. FinTab-LLaVA is instruction-tuned on FinTMD, a financial domain-specific dataset comprising table images and textual data, supporting tasks such as finance table question answering (FTQA), finance table fact verification (FTFV), and finance table description (FTD). Domain knowledge training can enhance its mathematical reasoning and financial expertise. By adopting a curriculum learning approach, FinTab-LLaVA extends Table-LLaVA to handle the unique requirements of financial table data. Experiment results show that FinTab-LLaVA outperforms existing models in financial table-based tasks and demonstrates strong generalization capabilities. Our findings emphasize the potential of domain-specific multimodal LLMs in processing financial data and significantly expands LLM applications in the financial sector. The code and data are available at https://github.com/Emilia0608/FinTab-LLaVA.
https://doi.org/10.1145/3714394.3756161
Increased information overload and cognitive fatigue have made it increasingly difficult for individuals to sustain attention and regulate emotions in everyday life. While mindfulness practice offers a proven approach to restoring attentional focus and emotional balance, its effectiveness can be undermined by frequent mind wandering. In this paper, we thus introduce a real-time robotic mindfulness intervention system that integrates a multi-biometric mind-wandering detector with the social robot Pepper. The system continuously monitors user state by analyzing biometric indicators and provides timely, context-aware verbal and non-verbal interventions generated via a GPT-based model. Our system estimates the mind-wandering (MW) level in real-time, and if the MW level exceeds a threshold, Pepper generates personalized feedback using off-the-shelf ChatGPT. In our experimental study with 45 participants, the robot-assisted group achieved higher Mindful Attention Awareness Scale (MAAS) scores and lower MW levels than both the no-intervention and audio-only groups. MAAS and MW showed a strong negative correlation (𝑟 = −0.78, 𝑝 < .001). These results demonstrate that the robot’s embodied presence and multimodal feedback effectively support attention recovery and enhance mindfulness engagement.
Mind-Wandering Detection, Multimodal Biometrics, Social Robot Intervention, Mindfulness, LLM-Based Adaptive Feedback
Abstract Text-to-SQL is the task of translating natural lan- guage queries into structured database queries. However, evaluating Text-to-SQL systems re- quires assessing not just accuracy but also ro- bustness to input variations. Current robustness evaluation methods depend on human annota- tors, which leads to subjective bias, high costs, and a lack of quantitative measures. To ad- dress these issues, we propose Stepwise Pertur- bation with In-Context Prompting, a method that systematically applies interpretable unit opera- tions for perturbation. Our experiments demon- strate the framework’s effectiveness in disrupt- ing model responses across various scenarios. This research offers a quantitative and unbi- ased assessment of Text-to-SQL model robust- ness and presents the potential for application in creating benchmark datasets for evaluating Text- to-SQL model robustness.
Text-to-SQL, Robustness, Stepwise Perturbation, In-Context Prompting
Optimizing code performance remains a core challenge in software development. In this pa- per, we introduce a Python PIE dataset of 844K slow-fast pairs labeled with Code Effi- ciency Scores (1-10) and frame score predic- tion as a 10-way classification task. Fine-tuning encoder-only pretrained models such as Code- BERT, GraphCodeBERT and UniXcoder with a lightweight classification layer on our Python PIE dataset, achieving 93.9% classification ac- curacy on the validation set. We then embed this classifier in an Optimization Agent that uses GPT3.5-turbo on HumanEval problems, auto- matically refining low-scoring solutions to re- duce average runtime by 21% and boost Pass@1 from 73.2% to 84.1% versus GPT3.5-turbo alone. Our proposed pipeline can be integrated into Python-based ML environments to support end-to-end code performance evaluation and op- timization.
Large Language Models, Code Generation, Code Efficiency, Software Development
Recent advances in Multimodal Large Language Models (MLLMs) have led to impressive per- formance across diverse vision-language tasks. However, these models exhibit a tendency to generate responses based more on linguistic priors, which refer to generalized knowledge learned during pretraining, rather than ground- ing their outputs in the actual visual input. This often results in responses that are linguistically plausible but misaligned with the visual con- tent, leading to hallucinations or factual inaccu- racies. This issue becomes especially apparent when the model is presented with semantically contradictory scenes that defy common sense or stereotypical expectations. To investigate their tendency to overlook visual inputs, we propose a novel benchmark composed of 60 synthetic im- ages and 180 QA pairs, intentionally designed to inject semantic contradictions between visual and linguistic modalities. We conduct evalua- tions across a range of state-of-the-art MLLMs, and find a consistent tendency to overlook visual information when it conflicts with textual con- tent.
Multimodal Large Language Models, Hallucination, Commonsense Conflict, Visual Question Answering
While recent advances in large language mod- els (LLMs) have substantially enhanced pro- ductivity across diverse domains, their appli- cation to programming knowledge tracing in programming education remains underexplored. In this paper, we introduce a novel evaluation benchmark for Programming Knowledge Trac- ing (PKT) and Programming Problem Recom- mendation (PPR) that embodies our three core contributions. We first reconstruct a reliable dataset by cleaning, normalizing, and rebalanc- ing problem logs, code submissions, and cor- rectness labels. Next, we define PPR task, de- signing a unified schema that uses session-based learning histories, code submissions, and prob- lem metadata to generate real-time, personalized next-problem suggestions. Finally, we expand the dataset into English with concept-tag anno- tations to support evaluation. Using this bench- mark, we systematically compare representa- tive deep learning–based KT models (e.g., DKT, PDKT) and an LLM baseline on both Knowl- edge Tracing (KT) accuracy and next-problem recommendation.
Large Language Models, Programming Education, Problem Recommendation, Knowledge Tracing
https://doi.org/10.1145/3746252.376090
We propose Spectral Edge Encoding (SEE), a parameter-free framework that quantifies each edge's contribution to the global structure by measuring spectral shifts in the Laplacian eigenvalues. SEE captures the low-frequency sensitivity of edges and integrates these scores into graph Transformer attention logits as a structure-aware bias. When applied to the Moiré Graph Transformer (MoiréGT) and evaluated on seven MoleculeNet classification benchmarks, SEE consistently improves ROC-AUC performance. In particular, MoiréGT+SEE achieves an average ROC-AUC of 85.3%, approximately 7.1 percentage points higher than the previous state-of-the-art model UniCorn (78.2%). Moreover, SEE preserves molecular topology and enables edge-level interpretability, offering a practical alternative to sequence-based chemical language models. These results demonstrate that spectrum-informed attention can simultaneously enhance performance and transparency in graph-based molecular modeling.
Graph Transfomer, Spectral Edge Encoding, Edge Sensitivity, Edge Importance, Molecular Property Prediction
https://doi.org/10.1145/3746252.376087
Graph Transformers (GTs) excel at long-range reasoning on graphs but often rely on costly positional encodings or auxiliary virtual nodes to perceive geometry. We present the RadialFocus Graph Transformer (RadialFocus), a geometry-aware GT that learns to modulate attention with a lightweight, distance-selective kernel. Each head is equipped with a differentiable radial basis function whose centre μ and width σ are trained end-to-end, boosting attention between nodes that lie inside its adaptive ''focus'' while gently suppressing others. Injecting the logarithm of this kernel into the pre-softmax logits preserves the stability and permutation invariance of standard self-attention, incurs negligible memory overhead, and removes the need for hand-crafted 3-D encodings or virtual nodes. On 3-D molecular benchmarks RadialFocus attains a validation MAE of 46.3, meV on PCQM4Mv2 with only 13 M parameters, surpassing models an order of magnitude larger. It also sets a new best average ROC-AUC (79.1 %) on MoleculeNet and reaches 0.957 MAE on PDBBind2020, a new high-water mark for binding-affinity prediction. The same architecture transfers to 2-D graphs, achieving 97.8 % accuracy on MNIST-Superpixel. Ablation studies indicate that the learned (μ, σ) capture task-relevant distance scales and that log-space fusion stabilises gradients. These findings suggest that a simple, learned distance modulation suffices to equip Transformers with strong geometric priors, enabling accurate and parameter-efficient reasoning across diverse graph domains.
Graph Transformer, Radial Basis Function, Geometric Deep Learning
Knowledge distillation (KD) is a powerful technique for model compression, enabling the creation of compact and efficient ”student” models by transferring knowledge from large-scale, pre-trained ”teacher” models. However, the application of traditional KD methods in this domain is considerably more challenging than in high-level tasks like classification, as the SISR task is to reconstruct image pixels a regression problem. Hence, to effectively distill the knowledge of a teacher model in SR, we propose MCAD-KD, Multi-Scale Contrastive–Adversarial Distillation for Super-Resolution. We utilize a novel hybrid contrastive learning framework that operates on both global (image-level) and local (patch-level) scales. Furthermore, we integrate adversarial guidance, which pushes the student’s output towards the manifold of realistic images, allowing it to potentially surpass the perceptual quality of the teacher by learning directly from the ground-truth data distribution. Our comprehensive framework synergistically combines these components to train a lightweight student model that achieves a superior trade-off between perceptual quality and computational efficiency.
https://doi.org/10.1145/3787256.3787266
A major challenge in the financial domain is the class imbalance problem, which can negatively impact model performance. Although oversampling methods such as SMOTE and ADASYN are commonly used to address this issue, they are often sensitive to outliers, which limits their effectiveness. To overcome this limitation, we propose an improved oversampling approach that integrates advanced outlier detection methods—including Isolation Forest variants and autoencoders—into the ADASYN framework. Experimental results show that our method achieves higher prediction accuracy and more stable performance, making it well-suited for real-world financial applications involving imbalanced datasets.
Anomaly Detection, Oversampling, SMOTE, ADASYN, Isolation Forest
https://doi.org/10.1145/3787256.3787264
This study presents a lifecycle-based clustering approach to analyze weekly sales data, aiming to enhance sales prediction accuracy for small-scale retailers. By combining time series clustering and deep learning methods, the proposed framework classifies sales patterns into four distinct lifecycle stages and predicts future cluster memberships. Using franchise store card and delivery sales data from an anonymous startup’s app, we demonstrate that integrating Self-Organizing Maps (SOM) with TimeSeriesKMeans (TSKMeans) improves clustering accuracy compared to using TSKMeans alone. Additionally, the Temporal Transformer Model (TTM) is applied to predict cluster classifications, achieving a validation accuracy of up to 89% over a 24-week period. The results underscore the effectiveness of lifecycle-based clustering for enhancing sales forecasting in dynamic and uncertain market environments, offering valuable insights for small retailers looking to optimize their sales strategies.
Clustering, Cluster prediction, TimeSeriesKmeans, SOM, Transformer
본 논문은 기준 영상이 없거나 포즈 정보가 불완전한 상황에서도 시점 간 일관성과 기하적 안정성을 함께 고려해 영상을 신뢰성 있게 평가할 수 있는 no-reference 품질평가 프레임워크를 제안한다. 기존 참조 기반 지표(PSNR·SSIM·LPIPS 등)는 정확한 기준 뷰를 요구해 실제 제작·검수 파이프라인에서 적용이 제한적이며, 시점 정합 오차나 구조적 불일치를 점수에 충분히 반영하지 못한다. 본 연구는 사전학습된 공간 특성 추출기(Spatial Feature Extractor)와 단일 영상 깊이 예측 트랜스포머(DPT)로부터 얻은 깊이 단서를 결합해 구조 민감도를 높였다. 평가는 360도와 FF(Forward-Facing) 데이터셋을 대상으로 수행되었고, 두 데이터셋을 공동 학습한 뒤 각 도메인에서 학습에 포함되지 않은 씬만으로 성능을 보고하여 동일 도메인 내 일반화를 검증하였다. 다양한 캡처 조건(실내/실외, 조명 변화, 텍스처 다양도)과 여러 3D Gaussian 렌더링 모델 전반에서 상관계수(SRCC/PLCC)가 일관되게 향상되었으며, 도메인 간 분포 차이에 대해서도 견고하게 동작함을 확인했다. 종합적으로, 제안 방법은 기준 영상 없이 장면 구조를 반영한 신뢰성 높은 품질 추정을 가능하게 한다.
https://doi.org/10.48550/arXiv.2506.03171
EdgeVidSum is a lightweight method that generates personalized, fast-forward summaries of long-form videos directly on edge devices. The proposed approach en- ables real-time video summarization while safeguarding user privacy through local data processing using innova- tive thumbnail-based techniques and efficient neural archi- tectures. Unlike conventional methods that process entire videos frame by frame, the proposed method uses thumb- nail containers to significantly reduce computational com- plexity without sacrificing semantic relevance. The frame- work employs a hierarchical analysis approach, where a lightweight 2D CNN model identifies user-preferred con- tent from thumbnails and generates timestamps to create fast-forward summaries. Our interactive demo highlights the system’s ability to create tailored video summaries for long-form videos, such as movies, sports events, and TV shows, based on individual user preferences. The entire computation occurs seamlessly on resource-constrained de- vices like Jetson Nano, demonstrating how EdgeVidSum ad- dresses the critical challenges of computational efficiency, personalization, and privacy in modern video consumption environments.
https://doi.org/10.1109/WACV61041.2025.00867
Precise retina Optical Coherence Tomography (OCT) image classification and segmentation are important for di- agnosing various retinal diseases and identifying specific regions. Alongside comprehensive lesion identification, re- ducing the predictive uncertainty of models is crucial for improving reliability in clinical retinal practice. However, existing methods have primarily focused on a limited set of regions identified in OCT images and have often faced challenges due to aleatoric and epistemic uncertainty. To address these issues, we propose CAMEL (Confidence- Aware Multi-task Ensemble Learning), a novel framework designed to reduce task-specific uncertainty in multi- task learning. CAMEL achieves this by estimating model confidence at both pixel and image levels and leveraging confidence-aware ensemble learning to minimize the uncertainty inherent in single-model predictions. CAMEL demonstrates state-of-the-art performance on a compre- hensive retinal OCT image dataset containing annotations for nine distinct retinal regions and nine retinal diseases. Furthermore, extensive experiments highlight the clini- cal utility of CAMEL, especially in scenarios with mini- mal regions, significant class imbalances, and diverse re- gions and diseases. Our code is publicly available at: https://github.com/DSAIL-SKKU/CAMEL.
The proliferation of deepfake content on social media threatens information integrity. The role of deepfake media's unique visual characteristics in propagation has been largely overlooked by prior research. This paper proposes a proactive predictive model that forecasts a deepfake image's total shares over a 15-day period using its visual features and early propagation data. We constructed a unique deepfake propagation dataset and analyzed how image attributes correlate with virality. Our model, which integrates GraphSAGE with an LSTM network and CNN-extracted visual features, achieves superior predictive accuracy. The results show our approach can effectively identify deepfake content with high viral potential, contributing to the early prediction and mitigation of deepfake dissemination.
본 연구는 온라인 우울증 자가 평가 환경에서 대화형 챗봇에 대한 자가 평가 정확도 비교평가를 수행하여 자가 평가 정확 도를 검증하는 것을 목표로 한다. 이를 위해 PHQ-9 기반의 대규모 언어모델 기반 온라인 우울장애 챗봇을 구축하고 기존 온라인 설문지 방식 간의 자가 평가 정확도를 실제 사용자를 대상으로 비교·평가하였다. 구축한 챗봇 시스템은 목적 지향 대화 시스템 구조를 적용하여 문항 간의 맥락적 연결성을 유지하여 자연스러운 질문을 사용자에게 제시하고, 사용자의 응 답을 유도하도록 설계되었다. 실제 사용자를 대상으로 한 실험에서 본 연구에서 구축한 챗봇 기반 방식에서 산출된 PHQ- 9 총점은 기존 PHQ-9 자가 평가 결과와 높은 일치율을 보이는 것으로 나타났으며, 이는 두 방식 간의 평가 정확도가 유 사함을 의미한다. 본 연구의 결과는 LLM 기반 챗봇이 임상 환경에서 신뢰할 수 있는 보조 수단으로 활용될 수 있도록 하 는 근거를 제시하며, 그 타당성과 실용성 확보에 기여한다.
본 논문은 제철소에서 용선 온도를 예측하기 위해 풍구 및 출선구의 이미지 데이터를 센서 데이터와 결합하여 분석 및 모델링한 연구를 다룬다. 기존의 관련 연구들은 센서 데이터만을 활용한 반면, 본 연구는 센서 데이터와 풍구 및 출선구 이미지 데이터를 함께 사용하여 용선 온도 예측에 미치는 영향을 분석하고자 한다. 실험 결과, 풍구 및 출선구 이미지 데이터의 활용이 예측 성능 향상에 도움을 주었으며, 이는 제철 공정에서 품질 관리와 효율성 향상에 기여할 수 있음을 시사한다.
인지왜곡(Cognitive Distortion)은 개인이 외부 사건을 해석하는 과정에서 사고가 비합리적으로 왜곡되는 현상으로, 발화 속 인지왜곡을 탐지하는 것은 심리치료의 효과를 높이는 핵심 요소이다. 그러나 기존 인지왜곡 탐지 연구는 주로 발화의 표면적 내용 분석에 치중해, 인지왜곡 사고의 구조적 특성을 충분히 반영하지 못했다. 본 연구는 이러한 한계를 보완하기 위해 대규모언어모델(LLM)을 활용해 인지왜곡 사고 구조의 특성을 추론하는 Signature-of-Thought(SoT) 기법을 제안한 다. SoT 는 LLM 이 다양한 인지왜곡 사고의 전형적 패턴과 단서를 통합해 사고 구조 시그니처(Signature)를 정의하고, 입 력 발화와의 사고 구조 유사성을 비교하여 인지왜곡 여부를 판단하도록 설계된 방식이다. 공개 인지왜곡 데이터셋을 활용 한 실험 결과, 제안한 SoT 기법은 기존 접근법 대비 F1-Score 기준 최대 9% 이상 상대적 성능 향상을 달성함으로써 사 고 구조 시그니처 기반 접근이 LLM 의 인지왜곡 탐지의 정확도와 일관성을 향상시키는 데 효과적임을 시사한다.
대규모 언어모델(LLM)은 활용 범위가 빠르게 확산되고 있으며, 우울장애 조기 탐지 등 정신건강 분야에 적용하려는 시도 가 이루어지고 있다. 그러나 학습 데이터에 내재된 사회적 편향이 모델 판단에 반영되어 결과의 공정성 및 신뢰성을 저해 할 수 있다는 우려에도 불구하고, 정신건강 영역에서의 성별 편향 문제는 아직 충분히 검증되지 않았다. 이에 따라, 본 연 구는 성별 정보가 LLM 의 우울장애 진단에서 편향을 유발하는지 검증하고, 현존하는 편향 완화 기법들을 통해 이를 통제 할 수 있는지 확인하고자 하였다. 구체적으로, 동일한 증상을 서술한 텍스트에 대해 성별 단서 포함 유무에 따른 우울장애 진단 결과 차이를 분석한 결과, 약 29.1% 샘플에서 진단 불일치가 발생함을 확인하였다. 이후 편향 완화 전략을 적용한 후 동일한 조건에서 진단을 수행한 결과, 전체 샘플의 90.1%에서 진단이 일치함을 확인하였다. 이러한 결과는 성별 단서 와 같은 비임상적 요인이 실제로 LLM 을 통한 우울장애 진단에 영향을 미칠 수 있음을 입증하고, 공정하고 신뢰할 수 있 는 결과를 얻기 위해서는 LLM 활용 시 편향 완화 전략을 함께 고려해야 함을 시사한다.
https://doi.org/10.1145/3706599.371972
While AI text generation is instantaneous, image generation and reasoning models take longer. To signal processing time to users, platforms tend to use interface cues such as progress bars, skeleton screeners, and textual descriptions of the generation process. However, little is known about whether and how these cues affect user experience of generative AI tools. Through semi-structured interviews with 11 participants who interacted with various AI image generation tools, we examined user perceptions of waiting times and progress indicators, which we term “generation cues.” Data show that users are generally accepting of, sometimes even value, waiting times, viewing them as an inherent part of the creative process. However, they rarely notice generation cues. These findings challenge traditional usability principles of speed, efficiency, and feedback, suggesting the need to rethink waiting times and generation cues in the context of generative AI. We offer design recommendations for leveraging waiting times and generation cues to enhance human-AI collaboration in creative domains.
Image generation; Waiting time perception; Generation cues; HumanAI collaboration; User experience design
https://doi.org/10.1145/3706599.371970
Dementia is a global public health concern, with approximately 55 million individuals worldwide currently living with it. Individual cognitive stimulation therapy (iCST) has been shown to improve quality of life for persons living with dementia (PLwDs). However, providing iCST at scale remains a serious challenge, specifically as it can add considerable burden on care partners. This project focuses on a voice assistant (VA) to support care partners in delivering iCST. We developed a VA prototype and conducted a qualitative study with care partners (N=5). Our preliminary findings show that using a VA to deliver iCST is feasible and acceptable. We have also identified design requirements for the VA to effectively provide iCST, including need for personalization reflecting dementia severity and individual interests, collaboration between care partners and PLwDs, and accessible interactions to minimize frustration and distress. These findings can inform the future design of inclusive and accessible VAs.
Dementia, voice assistant, cognitive stimulation therapy, LLMs
https://doi.org/10.1145/3706599.3719796
Recent advances in generative AI enable the production of photorealistic images that blur the line between authentic and synthetic content, raising concerns about misleading users when misused. We examined how the realism of AI-generated images affects users’ credibility judgments of misinformation and explored two key cognitive mechanisms: realism heuristic and synthetic heuristic. A between-subjects experiment (N = 253) revealed that highly realistic AI-generated images significantly increased the credibility of misinformation compared to less realistic images, and this effect was primarily driven by the realism heuristic, as photorealistic images were perceived as portraying reality. Additionally, photorealistic images tended to inhibit activation of the synthetic heuristic, which we newly proposed in this paper (i.e., the perception that images are artificially created or edited). Contrary to previous findings, our data show that regular users of generative AI and social media are more susceptible to misinformation. We propose identifying and labeling AI involvement in image creation and increasing literacy.
AI-Generated Misinformation, Photorealism, Realism Heuristic, Synthetic Heuristic, Credibility, Prior Experience
https://doi.org/10.1145/3706599.3719749
As generative AI becomes increasingly popular, some platforms have started to offer customization options, allowing users to modify responses to become shorter or longer. This study explores the impact of these customization options on user experience and perceived credibility of the generative AI tool. In an experiment with three conditions (no/voluntary/forced customization), participants (N = 194) interacted with Google Gemini on four health topics. The results showed that customization increased perceived control, which was positively associated with satisfaction with generated responses and source credibility. Perceived control brought about by customizing also had an empowering effect by boosting fact-checking intentions. Our findings offer novel design insights for customization affordances in generative AI platforms, and we provide recommendations on ways to leverage customization to jointly improve both user experience and socially responsible use of AI.
Customization, Satisfaction, Credibility, Fact-checking Intentions
https://doi.org/10.1145/3706599.3720264
Although social media platforms are beginning to implement policies for labeling AI usage in posts, we do not know how this labeling affects user perceptions and which types of labeling source and language would be most effective. What does AI-generated content mean to users? What are users’ perceptions of the content, content creator, and platform that has AI-labeled content? Do the effects differ depending on whether the label is attributed to the user or the platform? A focus group study (N=14) revealed that users appreciate how AI helps to create better content. However, their perceptions of AI-labeled content are shaped by their mental models of how social media algorithms work. Some participants viewed AI-labeled content as more trendy, while others saw it as direct advertisements. Some believed the content to be automatically fake, but their reactions varied depending on the type of content or account. Regarding labeling source, users preferred self-labeling over platform labeling. We discuss theoretical and practical implications for the design of social media interfaces for disclosing AI usage.
AI Labeling, UGC, User-Generated Content
https://doi.org/10.1609/aies.v8i2.36620
The widespread adoption of large language models (LLMs) and generative AI (GenAI) tools across diverse real-world applications has amplified the importance of addressing societal biases inherent within these technologies. While the Natural Language Processing (NLP) community has extensively studied LLM bias, research investigating how non-expert users perceive and interact with biases from these systems remains limited. As these technologies become increasingly prevalent, understanding this question is crucial to inform model developers in their efforts to mitigate bias. To address this gap, this paper presents findings from a university-level competition that challenged participants to design prompts specifically for eliciting biased outputs from GenAI tools. We conducted a quantitative and qualitative analysis of the submitted prompts and the resulting GenAI outputs. This analysis led to the identification of reproducible biases across eight distinct categories within GenAI systems. Furthermore, we identified and categorized the various strategies employed by participants to successfully induce these biased responses. Our findings provide unique insights into how non-expert users understand, engage with, and attempt to manipulate biases in GenAI tools. This research contributes to a deeper understanding of the user-side experience of AI bias and offers actionable knowledge for developers and policymakers working towards creating fairer and more equitable AI systems.
https://doi.org/10.1145/3708557.3716343
Nonverbal cues are essential for natural communication, yet voice assistants often lack such features, limiting their effectiveness in social contexts like emotion journaling. This study investigates the impact of different visual backchannel feedback patterns on conversational flow and user perception with an icon-based voice assistant. Using a within-subject design, participants experienced three feedback conditions: regular, synchronized, and randomized. Quantitative results showed that minimal differences in visual patterns significantly influenced user perceptions, with synchronized and randomized feedback generally outperforming regular feedback. Qualitative findings emphasized individual differences, highlighting the need for customizable feedback to address diverse user needs. This study contributes to the design of inclusive, user-centered voice assistant interfaces for social applications.
https://doi.org/10.1145/3711112
In this paper, we propose PracticeDAPR, an AI-based education-supported system for beginners in DAPR assessment practice. As professional identity is considered a pivotal goal in art therapy education, it is important to help beginners not to experience difficulties in professional identity development. Therefore, we designed the proposed system to provide the following three factors, which are closely associated with the professional identity formation of beginners in art therapy: (i) performance improvement, (ii) anxiety reduction, and (iii) self-efficacy enhancement. To this end, we adopt online peer-to-peer learning as the foundational learning approach. In addition, by introducing AI as a mentor, we let users not only interact with their peers but also experience AI assistance. The user study targeting graduate students in art therapy was conducted with both quantitative and qualitative methods. In general, users reported positive experiences with PracticeDAPR. The results of the structural equation model analysis showed that perceived usefulness is an important contributor to the three factors, highlighting the effectiveness of online peer-to-peer learning with the AI mentor. Furthermore, by deriving the results that intention to use can be promoted by performance improvement, it is demonstrated that PracticeDAPR can consistently help the development of the professional identity, which is not easily established in a short period of time. Discussion and implications are provided in relation to using AI and online peer-to-peer learning to support current art therapy education.
https://doi.org/10.1145/3708359.3712081
Psychological counseling, especially for children, heavily relies on capturing both verbal and non-verbal cues to understand and support each child’s emotional and developmental needs. Therefore, creating a detailed and accurate transcription for a child’s counseling session is crucial but often labor-intensive and time-consuming, which makes it challenging to maintain the consistency of counseling quality. Despite advancements in AI, current session analysis practices rely primarily on manual clinical assessments and struggle to accurately capture children’s verbal and non-verbal expressions. To address these challenges, we propose an AI-based expert support system designed to enhance child counseling analysis. The system comprises two key components: (i) a transcription generation model and (ii) an editable dashboard. The transcription generation model extracts verbal expressions from both children and counselors, verifies speakers’ identities, and objectively captures non-verbal cues using a Multimodal Large Language Model. The editable dashboard facilitates Counselor & AI collaboration, where AI reduces human bias by providing objectivity, and counselors mitigate the risk of over-reliance on AI while maintaining oversight. This collaboration ultimately enhances workflow efficiency and leads to accurate counseling analyses. An evaluation with 48 child counselors demonstrates the system’s superior effectiveness and usability compared to existing services, with a majority expressing a strong intent to continue using our system. The system not only improves transcription accuracy but also supports more precise analysis of counseling sessions, enabling counselors to focus more on therapeutic engagements. These findings highlight the system’s potential to reduce the workload of child counselors, improve the quality of counseling services, and provide valuable resources for both individual counseling and counselor training. To the best of our knowledge, our study is the first to propose an AI-based expert support system optimized for generating transcriptions for child counseling analysis.
본 논문은 제철소에서 용선 온도를 예측하기 위해 풍구 및 출선구의 이미지 데이터를 센서 데이터와 결합하여 분석 및 모델링한 연구를 다룬다. 기존의 관련 연구들은 센서 데이터만을 활용한 반면, 본 연구는 센서 데이터와 풍구 및 출선구 이미지 데이터를 함께 사용하여 용선 온도 예측에 미치는 영향을 분석하고자 한다. 실험 결과, 풍구 및 출선구 이미지 데이터의 활용이 예측 성능 향상에 도움을 주었으며, 이는 제철 공정에서 품질 관리와 효율성 향상에 기여할 수 있음을 시사한다.
10.1109/WACV61041.2025.00867
Precise retina Optical Coherence Tomography (OCT) image classification and segmentation are important for di-agnosing various retinal diseases and identifying specific regions. Alongside comprehensive lesion identification, re-ducing the predictive uncertainty of models is crucial for improving reliability in clinical retinal practice. However, existing methods have primarily focused on a limited set of regions identified in OCT images and have often faced challenges due to aleatoric and epistemic uncertainty. To address these issues, we propose CAMEL (Confidence-Aware Multi-task Ensemble Learning), a novel frame-work designed to reduce task-specific uncertainty in multi-task learning. CAMEL achieves this by estimating model confidence at both pixel and image levels and leveraging confidence-aware ensemble learning to minimize the un-certainty inherent in single-model predictions. CAMEL demonstrates state-of-the-art performance on a compre-hensive retinal OCT image dataset containing annotations for nine distinct retinal regions and nine retinal diseases. Furthermore, extensive experiments highlight the clini-cal utility of CAMEL, especially in scenarios with mini-mal regions, significant class imbalances, and diverse re-gions and diseases.
본 연구는 모바일 환경에서 제한된 하드웨어 성능을 감안하여, 기존의 정지 영상 기반 객체 탐지 및 추적 기법을 실시간 동영상 처리로 확장하는 방법을 제시한다. 이를 위해 YOLOv5를 경량화하고 DeepSORT 알고리즘을 통합하여, Jetson Nano와 같은 모바일 GPU 환경에서 실시간 다중 객체 추적이 가능하도록 구현하였다. 또한, 모델 구조 최적화와 Knowledge Distillation 기법을 적용하여 추론 속도를 개선하면서도 성능 저하를 최소화하는 방법을 제안한다. 드론 영상에서 발생할 수 있는 객체 크기 축소와 ID 스위칭 문제를 해결하기 위해, 저신뢰 탐지 기반의 새로운 추적 전략을 도입하였다. 제안된 알고리즘은 다양한 실험 환경에서 기존 YOLO 기반 모델들과 비교하여 실시간 다중 객체 추적 성능이 안정적임을 확인하였다. 이 결과는 UAV 영상 처리와 모바일 기반 실시간 비전 시스템의 성능 향상에 기여할 수 있음을 입증한다.
본 연구는 3D Gaussian Splatting 기반의 실시간 자유 시점 영상(Free Viewpoint Video) 재구성 시스템에서 발생하는 모션 블러 및 고스트 현상을 완화하기 위해, 전경-배경 분리와 이원화된 NTC 학습 구조인 SplitStream을 제안한다. 초기 프레임에서 2D 마스킹과 Space Carving 기법으로 전경과 배경을 3D 공간에서 분리한 후, 각 영역에 대해 독립적인 NTC(Neural Transformation Cache) 학습을 수행한다. 전경은 빠른 움직임에 유연하게 대응하기 위해 주로 변환만 학습하고, 배경은 정적 특성을 반영해 추가 Gaussian 학습과 pruning을 병행한다. 이러한 이원화된 학습 구조는 불필요한 배경 변형과 모션 블러를 줄이고, 연산 자원 효율성을 높인다. N3DV 데이터셋을 활용한 실험에서, 제안된 모델은 기존 대비 PSNR 및 시각적 품질에서 우수한 성능을 보였으며, 특히 동적 장면에서 전경과 배경 간 경계가 명확하게 유지되었다.
본 논문은 도메인 적대적 신경망(DANN)을 활용해 서양인 위주의 데이터셋으로 학습된 FER 모델이 동양인 얼굴에서 성능이 저하되는 문제를 해결하고자 한다. 감정 분류기와 도메인 분류기를 동시에 훈련시키되, 감정 분류 정확도는 극대화하고 도메인 분류 정확도는 최소화하도록 최적화하여 인종 독립적인 특징 표현을 학습한다. 제안된 방법의 유효성은 Tsinghua 및 JAFFE 동양인 데이터셋을 통합한 실험을 통해 검증되었으며, 단일 도메인 학습 모델 대비 교차 문화적 감정 인식 정확도가 유의미하게 향상되었음을 확인했다. 특히 동양인 얼굴에 대한 인식 성능이 크게 개선되었다. 본 연구는 FER 분야에서 데이터 편향 문제를 극복하고 보다 일반화된 모델 개발에 기여한다
10.1109/ICIP55913.2025.11084465
In the United States, as of 2023, pet ownership has reached 66% of households and continues to rise annually. This trend underscores the critical need for effective pet identification and monitoring methods, particularly as nearly 10 million cats and dogs are reported stolen or lost each year. However, traditional methods for finding lost animals like GPS tags or ID photos have limitations-they can be removed, face signal issues, and depend on someone finding and reporting the pet. To address these limitations, we introduce PawPrint and PawPrint+, the first publicly available datasets focused on individual-level footprint identification for dogs and cats. Through comprehensive benchmarking of both modern deep neural networks (e.g., CNN, Transformers) and classical local features, we observe varying advantages and drawbacks depending on substrate complexity and data availability. These insights suggest future directions for combining learned global representations with local descriptors to enhance reliability across diverse, real-world conditions. As this approach provides a non-invasive alternative to traditional ID tags, we anticipate promising applications in ethical pet management and wildlife conservation efforts.
10.1109/ICIP55913.2025.11084407
This paper addresses the problem of anticipating traffic accidents, which aims to forecast potential accidents before they happen. Real-time anticipation is crucial for safe autonomous driving, yet most methods rely on computationally heavy modules like optical flow and intermediate feature extractors, making real-world deployment challenging. In this paper, we thus introduce RARE (Real-time Accident anticipation with Reused Embeddings), a lightweight framework that capitalizes on intermediate features from a single pre-trained object detector. By eliminating additional feature-extraction pipelines, RARE significantly reduces latency. Furthermore, we introduce a novel Attention Score Ranking Loss, which prioritizes higher attention on accident-related objects over non-relevant ones. This loss enhances both accuracy and interpretability. RARE demonstrates a 4-8× speedup over existing approaches on the DAD and CCD benchmarks, achieving a latency of 13.6ms per frame (73.3FPS) on an RTX 6000. Moreover, despite its reduced complexity, it attains state-of-the-art Average Precision and reliably anticipates imminent collisions in real time. These results highlight RARE’s potential for safety-critical applications where timely and explainable anticipation is essential.
10.1109/WACV61041.2025.00280
Recent advancements in computer vision have led to a renewed interest in developing assistive technologies for individuals with visual impairments. Although extensive research has been conducted in the field of computer vision-based assistive technologies, most of the focus has been on understanding contexts in images, rather than addressing their physical safety and security concerns. To address this challenge, we propose the first step towards detecting anomalous situations for visually impaired people by observing their entire surroundings using an egocentric 360-degree camera. We first introduce a novel egocentric 360-degree video dataset called VIEW360 (Visually Impaired Equipped with Wearable 360-degree camera), which contains abnormal activities that visually impaired individuals may encounter, such as shoulder surfing and pickpocketing. Furthermore, we propose a new architecture called the FDPN (Frame and Direction Prediction Network), which facilitates frame-level prediction of abnormal events and identifying of their directions. Finally, we evaluate our approach on our VIEW360 dataset and the publicly available UCF-Crime and Shanghaitech datasets, demonstrating state-of-the-art performance.
본 연구는 모델 병합 시 발생하는 작업 간 간섭 문제를 해결하기 위해 병합 계수 학습 과정에 L1 정규화 기반 희소성을 도입하고, 병합 계수를 작업별로 최적화하는 기법을 제안함.
제안 기법은 다양한 데이터세트에서 기존 방식보다 우수한 성능을 보였으며, 희소성이 간섭을 줄이고 모델의 일반화 성능을 개선하는 데 중요한 역할을 함을 입증함.
https://doi.org/10.48550/arXiv.2506.03171
EdgeVidSum is a lightweight method that generates personalized, fast-forward summaries of long-form videos directly on edge devices. The proposed approach enables real-time video summarization while safeguarding user privacy through local data processing using innovative thumbnail-based techniques and efficient neural architectures. Unlike conventional methods that process entire videos frame by frame, the proposed method uses thumbnail containers to significantly reduce computational complexity without sacrificing semantic relevance. The framework employs a hierarchical analysis approach, where a lightweight 2D CNN model identifies user-preferred content from thumbnails and generates timestamps to create fast-forward summaries. Our interactive demo highlights the system's ability to create tailored video summaries for long-form videos, such as movies, sports events, and TV shows, based on individual user preferences. The entire computation occurs seamlessly on resource-constrained devices like Jetson Nano, demonstrating how EdgeVidSum addresses the critical challenges of computational efficiency, personalization, and privacy in modern video consumption environments.
10.48340/ecscw2025_ep01
The increasing use of online study rooms raises critical questions about the dynamics and challenges of virtual learning environments. This paper explores students' experiences in remote study settings through semi-structured interviews with 13 university students. We explore the factors influencing their participation, the obstacles they face before, during, and after study sessions, and the role of social interaction in sustaining motivation. Our findings highlight the significance of a perceived sense of responsibility--enhanced by camera usage--in maintaining concentration and engagement. Moreover, feelings of intimacy and belonging, particularly when studying with close peers, play a significant role in motivation and focus. Students also report challenges such as coordinating schedules and managing distractions in group study sessions. We propose design implications for enhancing online study environments based on these insights. We emphasize fostering a stronger sense of community, minimizing distractions, and facilitating effective collaboration. Our contributions inform the design of more inclusive and engaging virtual study platforms with broader implications for learning communities and online collaboration tools.
10.18653/v1/2024.emnlp-main.994
As the explainability of mental disorder detection models has become important, symptom-based methods that predict disorders from identified symptoms have been widely utilized. However, since these approaches focused on the presence of symptoms, the context of symptoms can be often ignored, leading to missing important contextual information related to detecting mental disorders. Furthermore, the result of disorder detection can be vulnerable to errors that may occur in identifying symptoms. To address these issues, we propose a novel framework that detects mental disorders by leveraging symptoms and their context while mitigating potential errors in symptom identification. In this way, we propose to use large language models to effectively extract contextual information and introduce an uncertainty-aware decision fusion network that combines predictions of multiple models based on quantified uncertainty values. To evaluate the proposed method, we constructed a new Korean mental health dataset annotated by experts, named KoMOS. Experimental results demonstrate that the proposed model accurately detects mental disorders even in situations where symptom information is incomplete.
10.18653/v1/2024.emnlp-industry.49
This study introduces a Multidisciplinary chILDhood cancer survivor question-answering (MILD) bot designed to support childhood cancer survivors facing diverse challenges in their survivorship journey. In South Korea, a shortage of experts equipped to address these unique concerns comprehensively leaves survivors with limited access to reliable information. To bridge this gap, our MILD bot employs a dual-component model featuring an intent classifier and a semantic textual similarity model. The intent classifier first analyzes the user’s query to identify the underlying intent and match it with the most suitable expert who can provide advice. Then, the semantic textual similarity model identifies questions in a predefined dataset that closely align with the user’s query, ensuring the delivery of relevant responses. This proposed framework shows significant promise in offering timely, accurate, and high-quality information, effectively addressing a critical need for support among childhood cancer survivors.
딥페이크 기술의 발전으로 인한 사회적 문제, 특히 딥페이크 포르노 콘텐츠의 확산이 심각한 이슈로 부각되고 있다. 기존의 콘텐츠 확산 예측 연구는 주로 텍스트 기반의 가짜 뉴스에 초점을 맞추고 있어, 시청각적 특성을 가진 딥페이크 콘텐츠의 확산 예측에는 한계가 있다. 본 연구는 딥페이크 콘텐츠의 전파 양상을 분석하고 예측하기 위해, 딥페이크 포르노 콘텐츠를 수집하고 이를 바탕으로 전파 트리를 모델링하였다. 네트워크 분석을 위해 위너 인덱스를 사용하여 딥페이크 콘텐츠의 확산 구조를 정량화하고, CNN 기반의 예측 모델을 통해 확산 가능성이 높은 콘텐츠를 식별하였다. 실험 결과, 위너 인덱스 상위 20%에 해당하는 딥페이크 콘텐츠를 효과적으로 예측할 수 있었으며, 이는 딥페이크 콘텐츠 확산의 조기 대응을 위한 중요한 기준을 제공한다
본 연구는 화재로 인한 인명 피해를 최소화하기 위해 AI 기술을 반영한 소방 훈련 시뮬레이션을 개발하며, 인간의 안전을 최우선으로 고려한 새로운 소방 훈련 가이드라인을 제시함.
특히, 인공지능 기반의 시뮬레이션을 통해 소방 훈련의 효과를 극대화하고 실질적인 현장 대응 능력을 강화함으로써, AI 기술이 공공 안전 분야에서 신뢰성 있고 실용적으로 활용될 가능성을 입증함.
본 연구는 인공지능 에이전트와의 상호작용에서 신뢰감을 향상시키는 비언어적 요인을 검증하며, AI 기술이 정신 건강 및 사회적 상호작용 맥락에서 효과적으로 활용될 가능성을 탐구함.
또한, 인간-AI 인터랙션 큐 설계 방법을 제안하고, 사용자 경험 평가 및 분석을 통해 인공지능 기반 서비스 설계를 위한 실질적이고 신뢰성 있는 인터페이스 가이드라인을 도출함으로써, AI 기술의 실용성과 윤리적 활용 가능성을 동시에 제시함.
본 연구는 AI 에이전트를 활용하여 언어 학습에서 학습자의 불안감과 인지적 부담을 줄이는 효과적인 피드백 방식을 제안하며, 학습자의 심리적 안정을 지원하는 AI 기반 교육 기술의 가능성을 탐구함.
특히, 편안하고 지지받는 학습 환경을 조성하는 대화형 에이전트 설계 방안을 제시하고, 피드백 방식과 출처의 조합이 영어 학습 만족도에 미치는 영향을 실증적으로 확인함으로써 AI 기술의 학습 효과 증진 및 신뢰성 있는 활용 방안을 제시함."