Siyuan Qi

PhD @ UCLA CS

About Me

I graduated from the Computer Science Department at the University of California, Los Angeles in 2019. During my Ph.D. study I did computer vision research in the Center for Vision, Cognition, Learning, and Autonomy advised by Professor Song-Chun Zhu.

I am currently working at Google. Please refer to my Google Scholar page for a more up-to-date publication list.

My research interests include Computer Vision, Machine Learning, and Cognitive Science.

We who cut mere stones must always be envisioning cathedrals.

Quarry worker's creed

News

2018
2018 Sep

One paper accepted at NeurIPS 2018

Cooperative Holistic 3D Scene Understanding from a Single RGB Image.

[paper] [supplementary]

Conference  Machine Learning  Computer Vision 

2018 Jul

Two papers accepted at ECCV 2018

Learning Human-Object Interactions by Graph Parsing Neural Networks.

[paper] [code]

Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image.

[paper] [supplementary] [code]

Conference  Computer Vision 

2018 May

One paper accepted at ICML 2018

Generalized Earley Parser: Bridging Symbolic Grammars and Sequence Data for Future Prediction.

[paper] [supplementary] [code]

Conference  Machine Learning  Computer Vision 

2018 May

One IJCV paper accepted.

Configurable 3D Scene Synthesis and 2D Image Rendering with Per-Pixel Ground Truth Using Stochastic Grammars.

[paper] [demo] [code]

Journal  Computer Vision  Computer Graphics 

2018 Feb

One paper accepted at CVPR 2018

Human-centric Indoor Scene Synthesis Using Stochastic Grammar.

[paper] [supplementary] [code] [project]

Conference  Computer Vision  Computer Graphics 

2018 Jan

Two papers accepted at ICRA 2018

Intent-aware Multi-agent Reinforcement Learning.

[paper] [demo] [code]

Unsupervised Learning of Hierarchical Models for Hand-Object Interactions Using Tactile Glove.

[paper] [demo] [code] [project]

Conference  Robotics 

2017
2017 Jul

One paper accepted at ICCV 2017

Predicting Human Activities Using Stochastic Grammar.

[paper] [demo] [code]

Conference  Computer Vision 

2017 Jun

One paper accepted at IROS 2017

Feeling the Force: Integrating Force and Pose for Fluent Discovery through Imitation Learning to Open Medicine Bottles.

[paper] [demo] [code] [project]

Conference  Robotics 

2017 Apr

Invited talk at VRLA Expo 2017

[Invited talk] I presented our work on "Examining Human Physical Judgments Across Virtual Gravity Fields" in VRLA 2017.

Invited Talk  Virtual Reality  Cognitive Science 

2016
2016 Nov

One paper accepted at IEEE Virtual Reality 2017

[Oral] The Martian: Examining Human Physical Judgments Across Virtual Gravity Fields.

Accepted to TVCG

[paper] [demo] [project]

Oral  Journal  Virtual Reality  Cognitive Science 

Projects


  • Human Activity Prediction

    This project aims to predict future human activities from partially observed RGB-D videos. Human activity prediction is generally difficult due to its non-Markovian property and the rich context between human and environments. We use a stochastic grammar model to capture the compositional/hierarchical structure of events, integrating human actions, objects, and their affordances.

    Computer Vision  Robotics 
  • Indoor Scene Synthesis by Stochastic Grammar

    This project studies how to realistically synthesis indoor scene layouts using stochastic grammar. We present a novel human-centric method to sample 3D room layouts and synthesis photo-realistic images using physics-based rendering. We use object affordance and human activity planning to model indoor scenes, which contains functional grouping relations and supporting relations between furniture and objects. An attributed spatial And-Or graph (S-AOG) is proposed to model indoor scenes. The S-AOG is a stochastic context sensitive grammar, in which the terminal nodes are object entities including room, furniture and supported objects.

    Computer Vision  Computer Graphics 

Publications


  • Cooperative Holistic Scene Understanding: Unifying 3D Object, Layout, and Camera Pose Estimation

    Siyuan Huang, Siyuan Qi, Yinxue Xiao, Yixin Zhu, Ying Nian Wu, Song-Chun Zhu.

    neurips 2018, Montreal, Canada

    Conference  Machine Learning  Computer Vision 

  • Abstract
    Holistic 3D indoor scene understanding refers to jointly recovering the i) object bounding boxes, ii) room layout, and iii) camera pose, all in 3D. The existing methods either are ineffective or only tackle the problem partially. In this paper, we propose an end-to-end model that simultaneously solves all three tasks in realtime given only a single RGB image. The essence of the proposed method is to improve the prediction by i) parametrizing the targets (e.g., 3D boxes) instead of directly estimating the targets, and ii) cooperative training across different modules in contrast to training these modules individually. Specifically, we parametrize the 3D object bounding boxes by the predictions from several modules, i.e., 3D camera pose and object attributes. The proposed method provides two major advantages: i) The parametrization helps maintain the consistency between the 2D image and the 3D world, thus largely reducing the prediction variances in 3D coordinates. ii) Constraints can be imposed on the parametrization to train different modules simultaneously. We call these constraints "cooperative losses" as they enable the joint training and inference. We employ three cooperative losses for 3D bounding boxes, 2D projections, and physical constraints to estimate a geometrically consistent and physically plausible 3D scene. Experiments on the SUN RGB-D dataset shows that the proposed method significantly outperforms prior approaches on 3D object detection, 3D layout estimation, 3D camera pose estimation, and holistic scene understanding.
  •           
@inproceedings{huang2018cooperative,
    title={Cooperative Holistic Scene Understanding: Unifying
3D Object, Layout, and Camera Pose Estimation},
    author={Huang, Siyuan and Qi, Siyuan and Xiao, Yinxue and Zhu, Yixin and Wu, Ying Nian and Zhu, Song-Chun},
    booktitle={Conference on Neural Information Processing Systems (NeurIPS)},
    year={2018}
}
  • Learning Human-Object Interactions by Graph Parsing Neural Networks

    Siyuan Qi*, Wenguan Wang*, Baoxiong Jia, Jianbing Shen, Song-Chun Zhu.

    ECCV 2018, Munich, Germany

    Conference  Computer Vision 

  • Abstract
    This paper addresses the task of detecting and recognizing human-object interactions (HOI) in images and videos. We introduce the Graph Parsing Neural Network (GPNN), a framework that incorporates structural knowledge while being differentiable end-to-end. For a given scene, GPNN infers a parse graph that includes i) the HOI graph structure represented by an adjacency matrix, and ii) the node labels. Within a message passing inference framework, GPNN iteratively computes the adjacency matrices and node labels. We extensively evaluate our model on three HOI detection benchmarks on images and videos: HICO-DET, V-COCO, and CAD-120 datasets. Our approach significantly outperforms state-of-art methods, verifying that GPNN is scalable to large datasets and applies to spatial-temporal settings.
  •           
@inproceedings{qi2018learning,
    title={Learning Human-Object Interactions by Graph Parsing Neural Networks},
    author={Qi, Siyuan and Wang, Wenguan and Jia, Baoxiong and Shen, Jianbing and Zhu, Song-Chun},
    booktitle={European Conference on Computer Vision (ECCV)},
    year={2018}
}
  • Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image

    Siyuan Huang, Siyuan Qi, Yixin Zhu, Yinxue Xiao, Yuanlu Xu, Song-Chun Zhu.

    ECCV 2018, Munich, Germany

    Conference  Computer Vision 

  • Abstract
    We propose a computational framework to jointly parse a single RGB image and reconstruct a holistic 3D configuration composed by a set of CAD models using a stochastic grammar model. Specifically, we introduce a Holistic Scene Grammar (HSG) to represent the 3D scene structure, which characterizes a joint distribution over the functional and geometric space of indoor scenes. The proposed HSG captures three essential and often latent dimensions of the indoor scenes: i) latent human context, describing the affordance and the functionality of a room arrangement, ii) geometric constraints over the scene configurations, and iii) physical constraints that guarantee physically plausible parsing and reconstruction. We solve this joint parsing and reconstruction problem in an analysis-by-synthesis fashion, seeking to minimize the differences between the input image and the rendered images generated by our 3D representation, over the space of depth, surface normal, and object segmentation map. The optimal configuration, represented by a parse graph, is inferred using Markov chain Monte Carlo (MCMC), which efficiently traverses through the non-differentiable solution space, jointly optimizing object localization, 3D layout, and hidden human context. Experimental results demonstrate that the proposed algorithm improves the generalization ability and significantly outperforms prior methods on 3D layout estimation, 3D object detection, and holistic scene understanding.
  •              
@inproceedings{huang2018holistic,
    title={Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image},
    author={Huang, Siyuan and Qi, Siyuan and Zhu, Yixin and Xiao, Yinxue and Xu, Yuanlu and Zhu, Song-Chun},
    booktitle={European Conference on Computer Vision (ECCV)},
    year={2018}
}
  • Generalized Earley Parser: Bridging Symbolic Grammars and Sequence Data for Future Prediction

    Siyuan Qi, Baoxiong Jia, Song-Chun Zhu.

    ICML 2018, Stockholm, Sweden

    Conference  Machine Learning  Computer Vision 

  • Abstract
    Future predictions on sequence data (e.g., videos or audios) require the algorithms to capture non-Markovian and compositional properties of high-level semantics. Context-free grammars are natural choices to capture such properties, but traditional grammar parsers (e.g., Earley parser) only take symbolic sentences as input. In this paper, we generalize the Earley parser to parse sequence data which is neither segmented nor labeled. This generalized Earley parser integrates a grammar parser with a classifier to find the optimal segmentation and labels, and makes top-down future predictions accordingly. Experiments show that our method significantly outperforms other approaches for future human activity prediction.
  •              
@inproceedings{qi2018generalized,
    title={Generalized Earley Parser: Bridging Symbolic Grammars and Sequence Data for Future Prediction},
    author={Qi, Siyuan and Jia, Baoxiong and Zhu, Song-Chun},
    booktitle={International Conference on Machine Learning (ICML)},
    year={2018}
}
  • Configurable 3D Scene Synthesis and 2D Image Rendering with Per-Pixel Ground Truth Using Stochastic Grammars

    Chenfanfu Jiang*, Siyuan Qi*, Yixin Zhu*, Siyuan Huang*, Jenny Lin, Lap-Fai Yu, Demetri Terzopoulos, Song-Chun Zhu.

    IJCV 2018

    Journal  Computer Vision  Computer Graphics 

  • Abstract
    We propose a systematic learning-based approach to the generation of massive quantities of synthetic 3D scenes and numerous photorealistic 2D images thereof, with associated ground truth information, for the purposes of training, benchmarking, and diagnosing learning-based computer vision and robotics algorithms. In particular, we devise a learning-based pipeline of algorithms capable of automatically generating and rendering a potentially infinite variety of indoor scenes by using a stochastic grammar, represented as an attributed Spatial And-Or Graph, in conjunction with state-of-the-art physics-based rendering. Our pipeline is capable of synthesizing scene layouts with high diversity, and it is configurable in that it enables the precise customization and control of important attributes of the generated scenes. It renders photorealistic RGB images of the generated scenes while automatically synthesizing detailed, per-pixel ground truth data, including visible surface depth and normal, object identity, and material information (detailed to object parts), as well as environments (e.g., illumination and camera viewpoints). We demonstrate the value of our dataset, by improving performance in certain machine-learning-based scene understanding tasks--e.g., depth and surface normal prediction, semantic segmentation, reconstruction, etc.---and by providing benchmarks for and diagnostics of trained models by modifying object attributes and scene properties in a controllable manner.
@article{jiang2018configurable,
    title={Configurable 3D Scene Synthesis and 2D Image Rendering with Per-Pixel Ground Truth Using Stochastic Grammars},
    author={Jiang, Chenfanfu and Qi, Siyuan and Zhu, Yixin and Huang, Siyuan and Lin, Jenny and Yu, Lap-Fai and Terzopoulos, Demetri, Zhu, Song-Chun},
    journal = {International Journal of Computer Vision (IJCV)},
    year={2018}
}
  • Human-centric Indoor Scene Synthesis Using Stochastic Grammar

    Siyuan Qi, Yixin Zhu, Siyuan Huang, Chenfanfu Jiang, Song-Chun Zhu.

    CVPR 2018, Salt Lake City, USA

    Conference  Computer Vision  Computer Graphics 

  • Abstract
    We present a human-centric method to sample and synthesize 3D room layouts and 2D images thereof, for the purpose of obtaining large-scale 2D/3D image data with the perfect per-pixel ground truth. An attributed spatial And-Or graph (S-AOG) is proposed to represent indoor scenes. The S-AOG is a probabilistic grammar model, in which the terminal nodes are object entities including room, furniture, and supported objects. Human contexts as contextual relations are encoded by Markov Random Fields (MRF) on the terminal nodes. We learn the distributions from an indoor scene dataset and sample new layouts using Monte Carlo Markov Chain. Experiments demonstrate that the proposed method can robustly sample a large variety of realistic room layouts based on three criteria: (i) visual realism comparing to a state-of-the-art room arrangement method, (ii) accuracy of the affordance maps with respect to ground-truth, and (ii) the functionality and naturalness of synthesized rooms evaluated by human subjects.
@inproceedings{qi2018human,
    title={Human-centric Indoor Scene Synthesis Using Stochastic Grammar},
    author={Qi, Siyuan and Zhu, Yixin and Huang, Siyuan and Jiang, Chenfanfu and Zhu, Song-Chun},
    booktitle={Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2018}
}
  • Intent-aware Multi-agent Reinforcement Learning

    Siyuan Qi, Song-Chun Zhu.

    ICRA 2018, Brisbane, Australia

    Conference  Reinforcement Learning  Robotics 

  • Abstract
    This paper proposes an intent-aware multi-agent planning framework as well as a learning algorithm. Under this framework, an agent plans in the goal space to maximize the expected utility. The planning process takes the belief of other agents' intents into consideration. Instead of formulating the learning problem as a partially observable Markov decision process (POMDP), we propose a simple but effective linear function approximation of the utility function. It is based on the observation that for humans, other people's intents will pose an influence on our utility for a goal. The proposed framework has several major advantages: i) it is computationally feasible and guaranteed to converge. ii) It can easily integrate existing intent prediction and low-level planning algorithms. iii) It does not suffer from sparse feedbacks in the action space. We experiment our algorithm in a real-world problem that is non-episodic, and the number of agents and goals can vary over time. Our algorithm is trained in a scene in which aerial robots and humans interact, and tested in a novel scene with a different environment. Experimental results show that our algorithm achieves the best performance and human-like behaviors emerge during the dynamic process.
  •              
@inproceedings{qi2018intent,
    title={Intent-aware Multi-agent Reinforcement Learning},
    author={Qi, Siyuan and Zhu, Song-Chun},
    booktitle={International Conference on Robotics and Automation (ICRA)},
    year={2018}
}
  • Unsupervised Learning of Hierarchical Models for Hand-Object Interactions

    Xu Xie, Hangxin Liu, Mark Edmonds, Feng Gao, Siyuan Qi, Yixin Zhu, Brandon Rothrock, Song-Chun Zhu

    ICRA 2018, Brisbane, Australia

    Conference  Robotics 

  • Abstract
    Contact forces of the hand are visually unobservable, but play a crucial role in understanding hand-object interactions. In this paper, we propose an unsupervised learning approach for manipulation event segmentation and manipulation event parsing. The proposed framework incorporates hand pose kinematics and contact forces using a low-cost easy-to-replicate tactile glove. We use a temporal grammar model to capture the hierarchical structure of events, integrating extracted force vectors from the raw sensory input of poses and forces. The temporal grammar is represented as a temporal And-Or graph (T-AOG), which can be induced in an unsupervised manner. We obtain the event labeling sequences by measuring the similarity between segments using the Dynamic Time Alignment Kernel (DTAK). Experimental results show that our method achieves high accuracy in manipulation event segmentation, recognition and parsing by utilizing both pose and force data.
@inproceedings{xu2018unsupervised,
    title={Unsupervised Learning of Hierarchical Models for Hand-Object Interactions},
    author={Xie, Xu and Liu, Hangxin and Edmonds, Mark and Gao, Feng and Qi, Siyuan and Zhu, Yixin and Rothrock, Brandon and Zhu, Song-Chun},
    booktitle={International Conference on Robotics and Automation (ICRA)},
    year={2018}
}
  • Predicting Human Activities Using Stochastic Grammar

    Siyuan Qi, Siyuan Huang, Ping Wei, Song-Chun Zhu.

    ICCV 2017, Venice, Italy

    Conference  Computer Vision 

  • Abstract
    This paper presents a novel method to predict future human activities from partially observed RGB-D videos. Human activity prediction is generally difficult due to its non-Markovian property and the rich context between human and environments. We use a stochastic grammar model to capture the compositional structure of events, integrating human actions, objects, and their affordances. We represent the event by a spatial-temporal And-Or graph (ST-AOG). The ST-AOG is composed of a temporal stochastic grammar defined on sub-activities, and spatial graphs representing sub-activities that consist of human actions, objects, and their affordances. Future sub-activities are predicted using the temporal grammar and Earley parsing algorithm. The corresponding action, object, and affordance labels are then inferred accordingly. Extensive experiments are conducted to show the effectiveness of our model on both semantic event parsing and future activity prediction.
  •              
@inproceedings{qi2017predicting,
    title={Predicting Human Activities Using Stochastic Grammar},
    author={Qi, Siyuan and Huang, Siyuan and Wei, Ping and Zhu, Song-Chun},
    booktitle={International Conference on Computer Vision (ICCV)},
    year={2017}
}

  • Feeling the Force: Integrating Force and Pose for Fluent Discovery through Imitation Learning to Open Medicine Bottles

    Mark Edmonds*, Feng Gao*, Xu Xie, Hangxin Liu, Siyuan Qi, Yixin Zhu, Brandon Rothrock, Song-Chun Zhu.

    IROS 2017, Vancouver, Canada

    Oral  Conference  Robotics 

  • Abstract
    Learning complex robot manipulation policies for real-world objects is challenging, often requiring significant tuning within controlled environments. In this paper, we learn a manipulation model to execute tasks with multiple stages and variable structure, which typically are not suitable for most robot manipulation approaches. The model is learned from human demonstration using a tactile glove that measures both hand pose and contact forces. The tactile glove enables observation of visually latent changes in the scene, specifically the forces imposed to unlock the child-safety mechanisms of medicine bottles. From these observations, we learn an action planner through both a top-down stochastic grammar model (And-Or graph) to represent the compositional nature of the task sequence and a bottom-up discriminative model from the observed poses and forces. These two terms are combined during planning to select the next optimal action. We present a method for transferring this human-specific knowledge onto a robot platform and demonstrate that the robot can perform successful manipulations of unseen objects with similar task structure.
@inproceedings{edmonds2017feeling,
    title={Feeling the Force: Integrating Force and Pose for Fluent Discovery through Imitation Learning to Open Medicine Bottles },
    author={Edmonds, Mark and Gao, Feng and Xie, Xu and Liu, Hangxin and Qi, Siyuan and Zhu, Yixin and Rothrock, Brandon and Zhu, Song-Chun},
    booktitle={International Conference on Intelligent Robots and Systems (IROS)},
    year={2017}
}
  • [Oral] The Martian: Examining Human Physical Judgments Across Virtual Gravity Fields.

    Tian Ye, Siyuan Qi, James Kubricht, Yixin Zhu, Hongjing Lu, Song-Chun Zhu.

    IEEE VR 2017, Los Angeles, California, USA
    Accepted to TVCG

    Oral  Journal  Virtual Reality  Cognitive Science 

  • Abstract
    This paper examines how humans adapt to novel physical situations with unknown gravitational acceleration in immersive virtual environments. We designed four virtual reality experiments with different tasks for participants to complete: strike a ball to hit a target, trigger a ball to hit a target, predict the landing location of a projectile, and estimate the flight duration of a projectile. The first two experiments compared human behavior in the virtual environment with real-world performance reported in the literature. The last two experiments aimed to test the human ability to adapt to novel gravity fields by measuring their performance in trajectory prediction and time estimation tasks. The experiment results show that: 1) based on brief observation of a projectile's initial trajectory, humans are accurate at predicting the landing location even under novel gravity fields, and 2) humans' time estimation in a familiar earth environment fluctuates around the ground truth flight duration, although the time estimation in unknown gravity fields indicates a bias toward earth's gravity.
@article{ye2017martian,
    title={The Martian: Examining Human Physical Judgments across Virtual Gravity Fields},
    author={Ye, Tian and Qi, Siyuan and Kubricht, James and Zhu, Yixin and Lu, Hongjing and Zhu, Song-Chun},
    journal={IEEE Transactions on Visualization and Computer Graphics},
    volume={23},
    number={4},
    pages={1399--1408},
    year={2017},
    publisher={IEEE}
}

More


  • ICML Travel Award, The International Machine Learning Society
    2018
  • ICRA Travel Grant, IEEE Robotics and Automation Society
    2018
  • First Class Honors, Faculty of Engineering, University of Hong Kong
    2013
  • Undergraduate Research Fellowship, University of Hong Kong
    2012
  • Kingboard Scholarship, University of Hong Kong
    2010 & 2011 & 2012
  • Dean's Honors List, University of Hong Kong
    2010 & 2011
  • AI Challenge (Sponsored by Google), 2nd place in Chinese contestants, 74th worldwide
    2011
  • Student Ambassador, University of Hong Kong
    2010
  • University Entrance Scholarship, University of Hong Kong
    2010