RS-Net: Context-Aware Relation Scoring for Dynamic Scene Graph Generation
Graphical Abstract
Abstract
Dynamic Scene Graph Generation (DSGG) models how object relations evolve over time in videos. However, existing methods are trained only on annotated object pairs and lack guidance for non-related pairs, making it difficult to identify meaningful relations during inference. In this paper, we propose Relation Scoring Network (RS-Net), a modular framework that scores the contextual importance of object pairs using both spatial interactions and long-range temporal context. RS-Net consists of a spatial context encoder with learnable context tokens and a temporal encoder that aggregates video-level information. The resulting relation scores are integrated into a unified triplet scoring mechanism to enhance relation prediction. RS-Net can be easily integrated into existing DSGG models without architectural changes. Experiments on the Action Genome dataset show that RS-Net consistently improves both Recall and Precision across diverse baselines, with notable gains in mean Recall, highlighting its ability to address the long-tailed distribution of relations. Furthermore, our qualitative analysis demonstrates that the contextually refined scene graphs from RS-Net serve as an effective semantic prior in downstream applications such as video captioning, guiding Large Multimodal Models (LMMs) to focus on salient actions. Despite the increased number of parameters, RS-Net maintains competitive efficiency, achieving superior performance over state-of-the-art methods.
Overall Framework
Overview of the proposed RS-Net. The framework consists of four main components: (a) object detection and relation representation construction, (b) spatial context encoder, (c) temporal context encoder, and (d) relation scoring decoder. RS-Net is trained to distinguish semantically meaningful relations from irrelevant ones by incorporating both spatial and temporal cues. The resulting scores are used to guide predicate classification and triplet score computation during scene graph generation.
Key Contributions
- RS-Net enables relation scoring that integrates with existing frameworks without changes.
- Our scoring uses spatial and temporal context tokens to assess each object pair.
- It scores relation importance using spatial and long-range temporal context cues.
- RS-Net can serve as a relation prior for LMMs’ downstream video understanding tasks.
Technical Details
3.1 Motivation and Overview
We propose RS-Net to address the distributional mismatch between limited training supervision and exhaustive inference in Dynamic Scene Graph Generation (DSGG). Unlike prior methods that implicitly learn relation importance through classification, RS-Net introduces an explicit relation scoring mechanism that ranks both positive and negative object pairs.
3.2 Relation Representation Construction
For each object pair, we construct relation representations by concatenating projected visual features, category embeddings, and union region features. This formulation captures both semantic and spatial interactions between objects.
3.3 Spatial and Temporal Context Encoding
RS-Net employs a Transformer-based spatial context encoder to model intra-frame relational dependencies, followed by a temporal context encoder that captures long-term dependencies across the entire video. A learnable context token summarizes frame-level and video-level information, enabling global relational reasoning beyond short temporal windows.
3.4 Relation Scoring
The relation scoring decoder predicts a 2-dimensional probability distribution indicating whether each relation is contextually meaningful. This design allows stable training and provides an explicit probabilistic interpretation for ranking object relations during inference.
RS-Net Integration
RS-Net is integrated into existing dynamic scene graph generation frameworks through a modular design that requires minimal architectural modification. The relation scoring module operates alongside the original predicate classifier, enabling explicit ranking of object relations using video-level temporal context. This integration enhances relational reasoning while preserving compatibility with existing DSGG pipelines.
Qualitative Results
Figure. Qualitative comparisons between STTran and STTran with our RS-Net.
Quantitative Results
Table 1. Performance comparison on the Action Genome dataset.
Table 2. Performance comparison with computational efficiency metrics in the SGDET task.
Results
We presented RS-Net, a modular relation scoring network for dynamic scene graph generation. RS-Net addresses two core limitations of existing DSGG methods: the lack of supervision for non-annotated object pairs and the inability to model long-range temporal context. To tackle these issues, RS-Net introduces a context-aware scoring mechanism that evaluates the semantic importance of object pairs by leveraging both spatial and video-level temporal representations. It integrates relation scores into a unified triplet scoring formulation, enhancing the model’s ability to identify meaningful relations while suppressing irrelevant ones. Experiments on the Action Genome dataset show consistent improvements in Recall, Precision, and mean Recall, while maintaining competitive efficiency. Thanks to its lightweight and modular design, RS-Net can be seamlessly integrated into existing frameworks without architectural modifications, making it a practical and generalizable solution for real-world video understanding. Furthermore, our qualitative analysis demonstrates that the structured knowledge generated by RS-Net can serve as reliable basis for multimodal reasoning tasks. These results further suggest that RS-Net can serve as a robust foundation for future DSGG research and diverse real-world video applications.
BibTeX
@article{jo2026rs,
title={RS-Net: Context-Aware Relation Scoring for Dynamic Scene Graph Generation},
author={Jo, Hae-Won and Cho, Yeong-Jun},
journal={Pattern Recognition},
pages={113352},
year={2026},
publisher={Elsevier}
}