turns out the driver used in WSLg for dev container doesn't support GL_ARB_bindless_texture, another reason why Linux is better suited for this :gura_pout:
Switching back from devcontainer cause how underdeveloped WSLg in the most part, from windowing issues to shitty driver that couldn't able to fully utilize my GPU. :cirno_facepalm:
I'll go back to devcontainer after starting to daily-driving Linux :ina_lownod:
The proper OpenGL solution is to bind an array of all your textures once per frame and access them by index. Bindless textures can incur significant penalties on some hardware and driver combos. I ran into lots of performance voodoo trying to get it to work properly.
Underneath it all, the expensive thing is changing uniform values. Uniforms occupy a special section of cache in the GPU and swapping them out is costly. Vulkan exposes this directly with Descriptor Sets and lets you manually manage which parts are being changed and when. With OpenGL, you're at the mercy of the driver to maybe do something clever if you carefully abide by a bunch of unwritten rules and also the stars align. Hence "rebinding textures is expensive", though it should actually be "rebinding any uniforms will probably be expensive".
Very nice :D Which kind of files are you able to support?
btw, I talked some with inginsub about the performance numbers.
He played around with a similar example (C++ Opengl). Did around 60 FPS with his first pass. 90 FPS with by hand optimizations. 50 FPS when using clang. And, then he passed the flags to (one of) the compiler for optimization, and got 3200 FPS.
So, it sounds like there is a lot of extra speed available to you in the future if you ever need it for something :D
@hazlin I’ve used Assimp and stb_image to load textures and models into the program. Both libraries support a lot of image/model file formats which I decided to stick with GLB for models and PNG/JPEG for textures.
I actually have to downsize the amount of objects to render from 2400 cubes to 100 monkey heads after I managed to load them into OpenGL. It effectively crippled my program because my batch rendering process really slowed down from having to batch complex meshes. At some point, I would need to revise my rendering process by utilizing instancing technique in OpenGL. Adding it support for skeletal animations really what made me held back.
The "procedural tree structure" in my code is literally for trees, like arboles, with leaves and acorns and such. It's just an example of a complex cpu-side algorithm that is running every frame.
The draw call itself to the gpu will be very fast in the grand scheme of things. It just adds some data to a list in the driver. It doesn't represent the rendering time, nor does rendering actually start during that call. You have to use gpu timers like http://www.lighthouse3d.com/tutorials/opengl-timer-query/ to see how long the drawing takes.
Basic rendering has roughly 3 steps: update a bunch of buffers, bind the shader, and issue draw commands. Binding the shader and issuing the draw commands are asynchronous; they merely add things into an internal command buffer in the driver (you do this explicitly in Vulkan). Updating buffers is the tricky part. You can't just write directly into GPU memory. The driver has to coordinate the writes so that there are no conflicts with ongoing asynchronous jobs, such as the previous frame which is still actually in progress on the GPU. If you try to update a buffer which is still being used by another render job (such as the previous frame), the driver will stall your cpu thread and wait until the gpu is finished with the buffer before doing the copy. This is called a pipeline stall and is only one of many ways of causing one. As such, any buffer which is updated every frame needs to cycle through three different gpu buffers so as to not stall rendering. Yes, three. You can realistically expect to have three frames in flight at various stages of progress at the same time. You can also partition one buffer into three sections and render from an offset, if you use the correct flags when creating the buffer and the correct functions for updating a region of it.
@gentoobro@MoeBritannica uhh, I don't have the numbers in front of me, but I believe I clocked the >rebuild the entire procedural tree structure at 37ms and the call to the gpu at about ~1ms.
... though I am very new to this, so maybe you mean something else by stalling the pipeline.
Though, me and MoeBritannica had the same frame rate of 24 FPS for 2400 cubes, building the render info and batching it in a single call. On GDScript for me, and C++/opengl for him.
inginsub is of course going to write something more efficient right out the gate than either of us. But, he didn't see the extreme performance gain you'd expect with C++ until he gave the compiler better flags.
Compiler flags won't turn 90 fps into 3200 fps with any code that wasn't purposefully written to exploit weaknesses in compiler optimizations. 2400 cubes is 57,000 polys, which should get you at least 1000 fps on really old hardware, even if you were doing a bunch of matrix math on the cpu side for each of the 2400 cubes. I rebuild the entire procedural tree structure on each frame for ~25,000 trunk segments, including a bunch of quaternion math and trig functions, and still don't dip below locked 60fps on my Ryzen 1600/GF 950 potato.
But updating the same buffer(s) every frame and stalling the pipeline would easily bring the framerate to its knees.
You need to partition those buffers into 3 sections and cycle between them every frame. The glBufferSubData call will try to overwrite the data that's already in the buffer (obviously) except the fact that that data is still being used by the GPU for the previous frame, so the call will just block until the previous frame is done. You won't notice this when your render speed is much faster than your refresh rate, but it will utterly destroy performance when you start actually pushing a GPU even a little bit.
@hazlin Did some massive revision with my batch rendering process by passing the mesh data once and instancing every different transformation of the objects. :cat_code:
It can now render 2500 dynamic objects (funny monkes) while maintaing a 60 FPS :ina_nod: For static objects, it can handle 22500 static objects (funny monkes too) while rendering in 60 FPS :tamamo_dab:
@hazlin apparently with compiler optimizations, I was able to squeeze much more performance. Able to render 40000 dynamic monkey objects while maintaining 60 FPS. :senko_shock: