The interplay between sound and vision is a key determinant of human perception. With the development of Virtual Reality (VR) technologies and their commercial applications, there is emergent need to better understand how audio-visual signals manipulated in virtual environments influence perception and human behaviour. The current study addresses this challenge in simulated VR environments mirroring real life scenarios. In particular, we investigated the parameters that might enhance perception, and thus VR experiences when sound and vision are manipulated. A VR museum was created mimicking a real art gallery featuring Japanese paintings. Participants were exposed to the gallery via Samsung Gear VR, head mounted display, and could freely walk in. To half of the participants newly composed music clips were played, during the VR gallery visit. The other participants were exposed to the same environment, but no music was played (control condition). The results showed that music played altered the way people are engaged in, perceive and experience the VR art gallery. Opposite to our expectation, the VR experience was liked more when no music was played. The naturalness and presence were perceived to be relatively high, and did not differ significantly depending on whether music was played or not. Regression modelling further explored the relationship between the parameters hypothesised to influence the VR experiences. The findings are summarised in a theoretical model. The study outcomes could be implemented to successfully develop efficient VR applications for art and entertainment.