Realtime 3D 360° Telepresence

Motivation

Studied telepresence scenario, where the user controls
our MAVI robot in a remote environment.
Studied telepresence scenario, where the user controls our MAVI robot in a remote environment.

TELEPRESENCE systems such as the one shown in the figure allow a user to immerse herself/himself into a remote environment. The telerobot here our MAVI platform [1]) is equipped with sensors which capture visual and auditory information about the distant space. Video and audio signals are exchanged over a communication network. Depending on the distance between the user and the operator, such communication networks introduce inevitable delays, which do not only lessen the immersive effect but are also detrimental to the visual comfort. The amount of time needed to mirror the user’s head-motion and display it onto the screen is denoted as the Motion-to-Photon (M2P) latency. If the M2P delay exceeds a certain threshold [2], the user will suffer from motion sickness and is prone to terminate the telepresence session in progress [3], [4], [5]. It is crucial that the sensory impressions from the visual system, the vestibular system, and the non-vestibular proprioceptors are in accordance with the user’s perceived ego-motion as well as the user’s expectations based on prior experiences [6].

Proposed Approach

We decided to exploit the benefits of an actuated stereoscopic camera system, which is able to provide the user with a 360×180 stereoscopic visual impression. The vision system provides a stereoscopic visual representation of the distant scene, i.e. separate imagery from different vantage points for each eye, to enable the sensation of depth. The actuated camera systems is a Pan-Tilt-Roll Unit (PTR-U), which has three Degrees-of-Freedom (DoFs) to mimic the user’s head motion at the client side. The ultra-low remote control is shown in the following video.

We decided to exploit the benefits of an actuated stereoscopic camera system, which is able to provide the user with a 360×180 stereoscopic visual impression, is lean, low-cost, mobile, and is able to provision a large stereo budget. The vision system provides a stereoscopic visual representation of the distant scene, i.e. separate imagery from different vantage points for each eye, to enable the sensation of depth. The actuation is done with a Pan-Tilt-Roll Unit (PTR-U), which has three Degrees-of-Freedom (DoFs) to mimic the user’s head motion at the client side. The ultra-low delay remote control is shown in the following video. 

Simply using a stereoscopic PTR-U, however, is not sufficient.The present total delay, which is an accumulation of various contributing latencies (see figure below), has a great (negative) impact on the Quality of Experience. If the M2P is too high and the head is turned, the screen will first stay static or frozen, until the motion is eventually reflected. The QoE is thereby highly dependent on the underlying control or prediction method.

SystemDiagram_v4

In view of these facts, I decided to exploit the benefits of the actuated stereoscopic camera system presented above and proposed several algorithms and techniques that provide the user the impression of a 3D 360° video experience and compensate the perceived delay introduced by the accumulated latency. In this way, we are able to provide instantaneous visual response, which keeps the motion-to-photon latency at the minimum and ensures, hence, a pleasant user experience.The compensation algorithms consists in general of three essential components.

Buffer-based delay compensation

The so-called Delay Compensating Vision System (DCVS) is introduced to compensate the perceived latency of the user. The DCVS deploys fisheye cameras to capture a larger FoV than is actually displayed on the Head-Mounted-Display to the user. A cache zone is thereby created around the displayed content. This buffer is leveraged for local delay compensation until the updated frame arrives. The compensation rate is introduced as metric to describe the achievable level of compensation.
The DCVS works obviously only for a certain range of motion. For fast head rotations, the viewport will be out of the extended viewport.

Dynamic FoV adaptation

To remedy unsatisfactory tendencies, for quick head rotations, we proposed a velocity-based FoV adaptation technique. Depending on the current angular head motion velocity, we temporarily decrease the displayed FoV of the user. In doing so, we are able to momentarily enlarge the cache area, which results in a higher level of compensation for fast rotations. This approach is motivated by the characteristics of the human eye. We claim that a temporarily reduction of the FoV during rapid head motions does not influence the feeling of presence. Instead, it has positive implications on the achievable level of compensation and supports the reduction of simulator sickness.

AI for viewport prediction

To further improve the achievable level of delay compensation, we investigated and developed several deterministic and probabilistic head motion prediction approaches. Rather than sending the actual head orientation to the client, we send the prospective viewport position depending on the present latency. We investigated various Deep Learning-based architectures. The best deep network is based on stacked Gated Recurrent Units (GRUs) and convolution components, which extract the most distinct features at different granularities. This network showed superior performance when compared to prior art.

[1] M. Karimi, T. Aykut, and E. Steinbach, “Mavi: A research platform for telepresence and teleoperation,” Robotics, vol. abs/1805.09447, 2018.
[2] S. M. LaValle, A. Yershova, M. Katsev, and M. Antonov, “Head tracking for the oculus rift,” in IEEE International Conference on Robotics and Automation (ICRA), 2014, 2014, pp. 187–194.
[3] J. T. Reason and J. J. Brand, Motion sickness. Academic press, 1975.
[4] J.-R. Wu and M. Ouhyoung, “On latency compensation and its effects on head-motion trajectories in virtual environments,” The visual computer, vol. 16, no. 2, pp. 79–90, 2000.
[5] R. S. Allison, L. R. Harris, M. Jenkin, U. Jasiobedzka, and J. E. Zacher, “Tolerance of temporal delay in virtual environments,” in Virtual Reality, 2001. Proceedings. IEEE, 2001, pp. 247–254.
[6] M. A. Watson and F. Black, “The human balance system: A complex coordination of central and peripheral systems,” Portland, OR: Vestibular Disorders Association, 2008.