DISCLAIMER: This article was migrated from the legacy personal technical blog originally hosted here, and thus may contain formatting and content differences compared to the original post. Additionally, it likely contains technical inaccuracies, opinions that the author may no longer align with, and most certainly poor use of English. This article remains public for those who may find it useful despite its flaws.
The Khronos Group did a great job in the last few years to once again prove that OpenGL is still in game and that it can become the ultimate graphics API of choice, if it is not that already. However, we must note that it is not quite yet true that OpenGL 4.1 is a superset of its competitor, DirectX 11. We still have some holes that still have to be filled and I think the ARB should not stop just there as there is much more potential in the current hardware architectures than that is currently exposed by any graphics API so establishing the future of OpenGL should start by going one step further than DX11. In this article I would like to present my vision of items of importance that should be included in the next revision of the specification and how I see the future of OpenGL.
Since the original OpenGL Longs Peak announcement, graphics developers were really excited to get their hands on the completely revised OpenGL 3 specification. Still, due to severe backward compatibility and portability issues the original plan seemed to be failed and developers expressed their great sense of disappointment about the ARB’s decision to choose rather a more evolutionary move away from the legacy API instead of the radical rewrite, the Khronos Group has proved that the decision was not necessarily bad for OpenGL and in fact we got now a pretty powerful API, even though the coexistence of the legacy and the new design greatly increased the complexity of the specification.
What we have now is an API that can really compete with DirectX 11 but I strongly believe that this is not the end of the story yet as we still have a lot of things to do in ahead of us. I mean this both from point of view of exposing more hardware capabilities as well as streamlining the API language itself to increase the productivity of the developers who use it. My plan is to target both of these issues in this article, also trying to focus on hardware functionalities that are not even exposed by other graphics APIs yet.
Exposing more hardware capabilities
In this chapter of the article I will talk about some familiar and some not so familiar hardware features and corresponding OpenGL extensions that should be included in the next revision of the specification in order to be able to confidently say that OpenGL is a strict superset of the competing graphics APIs. The extensions listed here are not in any particular priority order, they are just listed in a way that ease the discussion about their functionality.
This extension provides GLSL built-in functions allowing shaders to load from, store to, and perform atomic read-modify-write operations to a single level of a texture from any shader stage. Also, the extension also indirectly enables the same operations for buffer objects by using texture buffers. This enables developers to implement more sophisticated algorithms using shaders that require more complex data structures than just plain arrays.
An example use case can be the implementation of Order-Independent Transparency (OIT) using fragment linked lists as presented by AMD at GDC10. Of course, there are a lot of other techniques that could benefit from hardware accelerated random access images (called UAV textures/buffers in DX11 terminology) including algorithms related to global illumination, ray tracing, and my personal favorite: scene management.
As the introduction of new write operations to fragment shaders besides the traditional framebuffer writes makes the execution of the shaders sensitive to whether early-Z is used or not by the hardware, the extension also introduces a new fragment shader input layout qualifier called “early_fragment_tests” to force OpenGL to use early depth and stencil test. Otherwise the specification language is valid stating that the depth and stencil tests are performed after fragment shader execution.
Finally, the extension enables some form of control over the order of image loads, stores, and atomics relative to other pipeline operations accessing the same memory region both using the OpenGL API and from within shaders.
The API itself provides a DSA-style binding mechanism that enables binding to so called “image units” that are separate from that of texture image units. In the same style, the specification language and GLSL refers to the introduced read-write textures with the term “image”.
In my opinion this is one of the most important extensions that should be made core with OpenGL 4.2 and I’m pretty sure this will actually happen.
This extension relaxes the restrictions of OpenGL on rendering to a currently bound texture and provides a mechanism to avoid read-after-write problems. More precisely, the extension allows rendering to a currently bound texture in the following cases:
- If the reads and writes are from/to disjoint sets of texels (after accounting for texture filtering rules) so it should work unless the drawn areas overlap, or
- If there is only a single read and write of each texel, and the read is in the fragment shader invocation that writes the same texel (e.g. using texelFetch2D).
Some of these situations were already supported implicitly like rendering to a texture level and fetching from another texture level. But the extension goes further and provides an API function to put an explicit barrier between draw calls to ensure proper rendering.
The extension can be used to accomplish a limited form of programmable blending and can eliminate the need of any image or buffer data copy in case we can live with the restrictions mentioned above.
One may ask why we need this extension if we have the GL_EXT_shader_image_load_store extension as this one is just a subset of the functionality provided by that. The answer is simple: performance. While read-write textures can mimic the same functionality they usually use different hardware paths that are slower than regular read-only texture accesses. So it would be a definite benefit to having also this extension in core OpenGL.
This extension does not have public specifications yet, however it can be found in the extension lists of the latest Catalyst driver releases sometimes with EXT, sometimes with ARB prefix. The extension itself provides API to access a number of hardware atomic counters that provide efficient counter operations on a GPU global scale.
Atomic counters come handy when one has to read or write individual elements of a buffer or texture. As an example, this extension is needed to be able to efficiently implement the OIT algorithm mentioned earlier as, when constructing the fragment linked list, we need to have unique offsets to the linked list buffer. This unique offset can be, of course, acquired by using atomic read-modify-write operations but those perform much slower than hardware atomic counters.
Besides the mentioned example, atomic counters are useful in many algorithms from many domains, one important use case is to perform feedback operations similar to that provided by transform feedback. Such feedback operations can be used to perform various scene management or culling mechanisms.
The extension provides access to these atomic counters from GLSL and also makes it possible to back them up with buffer objects so after OpenGL draw calls the value of the counters is conserved in these buffers for subsequent use.
Early depth test is a common optimization for hardware accelerated graphics that can skip the evaluation of fragment shaders for fragments that end up being discarded because they don’t pass the depth test. The problem is that in case the fragment shader modifies the depth value of the fragment then the early depth test is disabled. One can force early depth test with the functionality introduced by the extension GL_EXT_shader_image_load_store but that can lead to some rendering artifacts as the modified depth value output by the fragment shader is not taken into account.
This extension allows the application to pass enough information to the GL implementation to activate some early depth test optimizations safely while still preserving the ability to account the final depth value in the depth test. In order to solve this, the extension introduces four new fragment shader input layout qualifiers called “depth_unchanged, “depth_any”, “depth_greater” and “depth_less”. The most interesting ones are the latest two that provide the ability to do early-Z and hierarchical-Z tests from one direction to discard some groups of fragments and still allow the fragment shader to safely modify the depth value.
This technique comes very handy in case of rendering volumetric particles, decals or billboards. Without this extension one have to sacrifice the possibility to do early rejection of fragments in order to be able to create the volumetric primitives mentioned.
As far as I know this feature is also present in DirectX 11 so it should be a must for OpenGL 4.x also. As the extension is an AMD one, I don’t know whether NVIDIA GPUs do support anything like this in hardware but even if not, they can simply ignore the new layout qualifiers and do late depth test instead. Of course, it would result in lower performance but if only functionality is concerned it should be just okay.
OpenGL provides two means to perform geometry instancing via the extensions GL_ARB_draw_instancedand GL_ARB_instanced_arrays. While this (yet non-existent) extension would extend both, it is more relevant in case of the extension mentioned later so I named it accordingly.
The extension should trivially add the possibility to specify a “first instance” parameter for the instanced draw commands. Whether this is accomplished by introducing new variants of the glDrawElement* and glDrawArrays* draw commands or having a separate command for specifying the new parameter is up to the ARB. The extension should also interact with GL_ARB_draw_indirect which already mentions the lack of the parameter in GL and reserved already a field in the indirect draw command structure for specifying the “first instance” parameter.
This extension itself would be much more a bug fix rather than a completely new feature as this functionality should have been already exposed at the first time instancing was introduced to OpenGL.
This is one of the extensions I would be the most happy to see in the next release of the OpenGL specification. It would be a functional addition to the GL_ARB_draw_indirect extension that currently only allows the execution of a single instanced draw command that sources its parameter from a buffer object.
The new extension would add a new buffer binding point called e.g. GL_DRAW_INDIRECT_PRIMITIVE_COUNT that would specify the source of the “primcount” parameter to the following newly introduced draw commands:
void MultiDrawArraysIndirect( enum mode, sizei stride, const void *indirect, const void *primcount ); void MultiDrawElementsIndirect( enum mode, enum type, sizei stride, const void *indirect, const void *primcount );
This would not just allow for executing multiple indirect draw commands at once, without further CPU action, but also would source the “primcount” parameter from a buffer object thus if the draw commands are generated using transform feedback, read-write buffers or OpenCL (e.g. based on some GPU based scene management algorithm) then the application does not have to use asynchronous queries or other means that may introduce sync points in the rendering to be able to feed the “primcount” parameter.
Some people said that this is quite a futuristic feature to expect and most probably such functionality will be available only on newer generation of GPUs and maybe with OpenGL 5. I was not that pessimistic so I decided to raise my question to the relevant ARB members of NVIDIA and AMD. While I did not receive any answer from NVIDIA, I did received some good news from AMD as they said that this functionality can be implemented for Shader Model 5.0 level hardware.
What this extension would give developers is a way to efficiently implement GPU based scene management where the GPU bakes together all the rendering commands for the current frame using atomic counters and buffer writes, and the CPU just have to issue a few or maybe just a single MultiDraw*Indirect command to render the whole scene. But of course, the feature can increase draw command throughput also in case of CPU based scene management.
So my message to the Khronos Group is please, start working on such an extension as this would not just make developers happy, but you can also strengthen OpenGL’s position in the industry by putting something into the specification that even DirectX 11 cannot do.
OpenGL 4.0 introduced the extension GL_ARB_transform_feedback3 that further extended the transform feedback capabilities provided by earlier extensions to allow ouput to separate vertex streams. However there is one caveat: separate vertex streams are only supported for point primitives.
This new AMD extension does nothing more than just simply removes that restrictions for separate output streams allowing the same set of primitive types to be used with multiple transform feedback streams as with a single stream as long as the primitive types are the same for all output streams.
Limiting the possible output primitive types for transform feedback into multiple streams should not be a problem unless you want also to rasterize some triangles at the same time you output. Without relaxing this restriction can do this only by issuing two separate draw commands that incurs a performance hit.
I don’t know if the restriction is present in the ARB extension because NVIDIA does not support this in hardware but if this is not the case then I think this extension should be included in the next release of the specification. Otherwise, please NVIDIA include this feature in your next GPU generation.
OpenGL 3.1 already introduced a method to provide GPU accelerated copy of buffer data. This NVIDIA extension provides a similar functionality that can be used to execute efficient image data transfer between image objects (i.e. textures and renderbuffers).
While there are already methods to perform image data copies between textures e.g. using the GL_EXT_framebuffer_blit extension promoted to core with OpenGL 3.0 these require expensive framebuffer object operations and they also lack direct support for transferring 3D image data.
This extension simply introduces a single command that allows such image data copies for every type of textures (including cube maps, 3D textures and array textures) without the need to bind the image objects or otherwise configure the rendering.
The extension GL_ARB_depth_clamp promoted to core with OpenGL 3.2 introduced the ability to control the clamping of the depth value for both the near and far clip planes. This eliminates artifacts like seeing inside an object happening when the object’s geometry is clipped by the near clip plane.
This new extension provides a mean for the application to enable depth clamp separately for the near and the far clip plane. This increases the flexibility of depth clamping and can save some fill-rate in certain situations.
I don’t think that I have to talk too much about this extension as it should be familiar to all of you. It simply enables the possibility to use anisotropic filtering on a per-texture basis. I really wonder how this extension didn’t make its way into core as it is supported by hardware since more than a decade.
I know that the extension itself is supported by all relevant graphics driver vendors but really, why we can’t just simply include it in the core specification?
This is another yet non-existent extension that would extend GL_ARB_texture_gather by adding GLSL built-in functions called textureGatherLod that would allow gathered fetches with explicit LOD. I’m not sure if these functions are missing from the specification because of lack of hardware support or just because the ARB thought they might not be of any use. Anyway, if the hardware supports it then OpenGL should expose it to developers as there are certain situations when one has to use explicit LOD and could benefit from the increased fetching performance enabled by gathered fetches.
This extension was published at the time the OpenGL 4.1 specification came out and provides the ability for the fragment shader to output the stencil reference value that was otherwise configurable only using API calls. This enables a great level of flexibility to existing and future stencil buffer based algorithms making it possible also to directly write independent values to the stencil buffer on a per-fragment basis.
The predecessor of the extension is GL_AMD_shader_stencil_export and as such it indicates that maybe it is only supported in hardware on AMD GPUs. However, if this is not the case and NVIDIA could support this also then I think it worths to promote this feature also to core OpenGL.
Streamlining the API
After discussing the long list of functional features that would be nice to be included into the next release of OpenGL let’s focus on the API improvement extensions and ideas that are necessary to improve the usability of the API itself. Actually this part could go way longer than I’ll discuss because as we get more and more features to OpenGL, developers struggle with the increased complexity of the API. I’ll try to focus on the most crucial issues.
This is the extension what all OpenGL developers are waiting for a long time now. Direct state access eliminates the OpenGL API’s stupid “bind-to-modify” nature.
For a very long time the only vendor supporting the extension was NVIDIA. Fortunately, since Catalyst 10.7 AMD also exposes the extension to developers. Still, I have one problem: this extension is very poorly designed.
The main problem with the extension is that the functions were designed in a way that a naive implementation could be done by simply using “bind-to-modify” under the hood. That’s what resulted in crazy API functions like MultiTexParameter* and friends. Also, enabling DSA for all of the deprecated functionalities would result in an explosion of the API specification and as a consequence it would result in bloated specification language. Finally, I would also like to object somewhat the lack of creativity of the contributors regarding to the awkward naming conventions present in the current DSA extension.
In my opinion the Khronos Group has to address the issue by creating a new ARB version of the DSA extension that focuses strictly on core functionalities, throwing away DSA support for deprecated features (if somebody needs to use deprecated features they can still use the EXT version) and provide a naming convention that fits much better into the current API language.
Anyway, I completely agree with the other developers out there and scream for DSA. I think the Khronos Group has to eliminate the problem of the “bind-to-modify” semantics as soon as possible otherwise, even though the core specification exposes more and more hardware features, developers will not be attracted to use OpenGL.
The ARB moved in the right direction when they introduced the GL_ARB_explicit_attrib_location extension by eliminating the need to use dummy API calls to bind vertex attributes and output buffers to shader variables but they should not stop here. One of the most important addition could be adding a similar language syntax to GLSL that would allow us to bind sampler uniforms to texture image units. Obviously, the same goes for read-write images if GL_EXT_shader_image_load_store is included.
Similar to the previous request, uniform block indices should be as well explicitly specifiable in the shaders themselves. This extension would add exactly such functionality. The implementation is also straightforward: just a simple uniform block layout qualifier has to be added.
Other API clarifications
Besides the major issues the current specification language also has some bugs and unclear parts that should be addressed as well:
- Program pipeline objects are created by binding the object name which is not in line with the rest of the API language.
- No language is about whether program pipeline objects are shared among contexts or not which suggests that they aren’t which is not in align with the fact that program and shader objects are shared.
Most probably there are a lot more issues with the specification language but for now just these came into my mind. Maybe some of you can extend the list with tons of other specification mistakes.
OpenGL 4.2 and beyond
While my feature requests cover most of the needed functionality that should be included in the next revision of the OpenGL specification, there are a lot of other things that could be very useful for developers but are very unlikely to get their way into the specification any soon. I will talk about these features in this section of the article as these raise much more questions than just to be able to simply include it in OpenGL 4.2.
We have multi-GPU designs like SLI and CrossFire for a long time now. Fortunately, we have also vendor specific extensions to create affinity contexts that are associated with a single GPU of a multi-GPU configuration. We have WGL_AMD_gpu_association and WGL_NV_gpu_affinity for Windows and GLX_AMD_gpu_association on GLX based platforms. I have just two problems with this:
- First, these are vendor specific extensions.
- Second, NVIDIA exposes its affinity context support only on Windows and just for their professional cards, leaving consumer hardware owners without affinity context support.
I would be pleased to see in the future extensions like WGL_ARB_gpu_affinity_context and GLX_ARB_gpu_affinity_context that will be supported by both NVIDIA and AMD, and that are supported on both professional and consumer hardware.
I would like to see something similar in OpenGL that what we have in OpenCL. Having several separate command buffers for a single OpenGL context can have its performance benefits as some of the implicit sync points that are otherwise present in OpenGL draw commands could be eliminated. Another solution would be to use simply multiple GL contexts but it is much more complicated and context switches are quite heavy-weight operations. This would be something like how framebuffer objects replaced pbuffers.
Also this could go that far as we can encapsulate state manipulation data into command buffers in a similar way how display lists allowed this in many cases just in a more efficient and hardware centric manner.
Immutable state objects
Another thing strongly related to the previous idea would be immutable state objects. If state management data could not be efficiently stored in such a command buffer we could use instead immutable state objects that would be very similar in nature to display lists that are hiding the underlying representation of the commands.
Display lists are deprecated and I don’t think it was a wrong decision. It made the API language complex and you’ve never knew which command compiles into display lists and how. I remember the time I was making an OpenGL app on my GeForce2 and used DrawElements calls inside display lists that referenced buffer object data. Funnily it was working on NVIDIA hardware, even though the specification says otherwise, and I was wondering why I my app crashes on ATI cards.
Anyway, display lists are gone, but we need some complex state objects that could fill those holes that were left after them.
I was very happy to see the appearance of an extension that introduced the callback concept into OpenGL (GL_AMD_debug_output). Since that, the functionality was promoted to an ARB extension meaning that the ARB has accepted the fact that we need callbacks.
What I would like to see in the future is more OpenGL callbacks. One of the most trivial things I can think of are asynchronous queries. It would so much easier if we would be able to receive a callback from OpenGL when the results of our asynchronous queries are available, rather than having to manually poll it for result in various phases of the rendering.
Actually, I could imagine callbacks for every rendering command issued that will be called by the driver as soon as the actual rendering is complete on the GPU side.
This is one another thing that developers are screaming for. Fortunately now we have indirect methods to solve most of the issues of programmable blending via the extensions GL_EXT_shader_image_load_store and GL_NV_texture_barrier, however a more general solution would be welcomed.
I don’t know whether this would be actually possible on current hardware but if not, then this is a message to hardware vendors to solve the issue in the near future.
We’ve seen that even though OpenGL is on track and the Khronos Group is keeping up the pace with its competitors, still there are lots of room for improvement regarding to the OpenGL specification from both functional point of view as well as from API design point of view.
I would like to end the article with a summary of what I expect to be part of the OpenGL 4.2 specification and my personal wish-list beyond those in some kind of priority order.
My expectations for OpenGL 4.2:
My personal wish-list for OpenGL 4.2: