I have a video player application, and use multiple threads in order to keep the user interaction still smooth.
The thread that decodes the video originally just wrote the resulting frames as BGRA into a RAM buffer, which got uploaded to VRAM by glTexSubImage2D which worked fine enough for normal videos, but -as expected- got slow for HD (esp 1920x1080).
In order to improve that I implemented a different kind of pool class which has its own GL context (NSOpenGLContext as I am on Mac), which shares the resources with the main context.
Furthermore I changed the code so that it uses
glTextureRangeAPPLE( GL_TEXTURE_RECTANGLE_ARB, m_mappedMemSize, m_mappedMem );
and
glTexParameteri(GL_TEXTURE_RECTANGLE_ARB, GL_TEXTURE_STORAGE_HINT_APPLE, GL_STORAGE_SHARED_APPLE);
for the textures that I use, in order to improve the performance of uploading to VRAM.
Instead of uploading BGRA textures (which weigh in at about 8MB per frame for 1920x1080) I upload three individual textures for Y, U and V (each being GL_LUMINANCE, GL_UNSIGNED_BYTE, and the Y texture of original size, and the U and V at half the dimensions), thereby reducing the size being uploaded to about 3 MB, which already showed some improvement.
I created a pool of those YUV textures (depending on the size of the video it typically ranges between 3 and 8 surfaces (times three as it is Y, U and V components) - each of the textures mapped into its own area of the above m_mappedMem.
When I receive a newly decoded video frame I find a set of free YUV surfaces and update the three components each with this code:
glActiveTexture(m_textureUnits[texUnit]);
glEnable(GL_TEXTURE_RECTANGLE_ARB);
glBindTexture(GL_TEXTURE_RECTANGLE_ARB, planeInfo->m_texHandle);
glTexParameteri(GL_TEXTURE_RECTANGLE_ARB, GL_TEXTURE_STORAGE_HINT_APPLE, GL_STORAGE_SHARED_APPLE);
glPixelStorei(GL_UNPACK_CLIENT_STORAGE_APPLE, GL_TRUE);
memcpy( planeInfo->m_buffer, srcData, planeInfo->m_planeSize );
glTexSubImage2D( GL_TEXTURE_RECTANGLE_ARB,
0,
0,
0,
planeInfo->m_width,
planeInfo->m_height,
GL_LUMINANCE,
GL_UNSIGNED_BYTE,
planeInfo->m_buffer );
(As a side question: I am not sure if for each of the textures I should use a different texture unit? [I am using unit 0 for Y, 1 for U and 2 for V btw])
Once this is done I put the textures that I used are marked as being used and a VideoFrame class is filled with their info (ie the texture number, and which area in the buffer they occupy etc) and put into a queue to be rendered. Once the minimum queue size is reached, the main application is notified that it can start to render the video.
The main rendering thread meanwhile (after ensuring correct state, etc) then accesses this queue (that queue class has its access internally protected by a mutex) and renders the top frame.
That main rendering thread has two framebuffers, and associated to them via glFramebufferTexture2D two textures, in order to implement some kind of double buffering.
In the main rendering loop it then checks which one is the front buffer, and then renders this front buffer to the screen using texture unit 0:
glActiveTexture(GL_TEXTURE0);
glEnable(GL_TEXTURE_RECTANGLE_ARB);
glBindTexture(GL_TEXTURE_RECTANGLE_ARB, frontTexHandle);
glTexEnvi(GL_TEXTURE_ENV, GL_TEXTURE_ENV_MODE, GL_REPLACE);
glPushClientAttrib( GL_CLIENT_VERTEX_ARRAY_BIT );
glEnableClientState( GL_VERTEX_ARRAY );
glEnableClientState( GL_TEXTURE_COORD_ARRAY );
glBindBuffer(GL_ARRAY_BUFFER, m_vertexBuffer);
glVertexPointer(4, GL_FLOAT, 0, 0);
glBindBuffer(GL_ARRAY_BUFFER, m_texCoordBuffer);
glTexCoordPointer(2, GL_FLOAT, 0, 0);
glDrawArrays(GL_QUADS, 0, 4);
glPopClientAttrib();
Before doing that rendering to the screen of the current frame (as the usual framerate is about 24 fps for videos, this frame might be rendered a few times before the next videoframe gets rendered - that's why I use this approach) I call the video decoder class to check if a new frame is available (ie it is responsible for syncing to the timeline and updating the backbuffer with a new frame), if a frame is available, then I am rendering to the backbuffer texture from inside the videodecoder class (this happens on the same thread as the main rendering thread):
glBindFramebuffer(GL_FRAMEBUFFER, backbufferFBOHandle);
glPushAttrib(GL_VIEWPORT_BIT); // need to set viewport all the time?
glViewport(0,0,m_surfaceWidth,m_surfaceHeight);
glMatrixMode(GL_MODELVIEW);
glPushMatrix();
glLoadIdentity();
glMatrixMode(GL_PROJECTION);
glPushMatrix();
glLoadIdentity();
glMatrixMode(GL_TEXTURE);
glPushMatrix();
glLoadIdentity();
glScalef( (GLfloat)m_surfaceWidth, (GLfloat)m_surfaceHeight, 1.0f );
glActiveTexture(GL_TEXTURE0);
glBindTexture(GL_TEXTURE_RECTANGLE_ARB, texID_Y);
glActiveTexture(GL_TEXTURE1);
glBindTexture(GL_TEXTURE_RECTANGLE_ARB, texID_U);
glActiveTexture(GL_TEXTURE2);
glBindTexture(GL_TEXTURE_RECTANGLE_ARB, texID_V);
glUseProgram(m_yuv2rgbShader->GetProgram());
glBindBuffer(GL_ARRAY_BUFFER, m_vertexBuffer);
glEnableVertexAttribArray(m_attributePos);
glVertexAttribPointer(m_attributePos, 4, GL_FLOAT, GL_FALSE, 0, 0);
glBindBuffer(GL_ARRAY_BUFFER, m_texCoordBuffer);
glEnableVertexAttribArray(m_attributeTexCoord);
glVertexAttribPointer(m_attributeTexCoord, 2, GL_FLOAT, GL_FALSE, 0, 0);
glDrawArrays(GL_QUADS, 0, 4);
glUseProgram(0);
glBindTexture(GL_TEXTURE_RECTANGLE_ARB, 0);
glActiveTexture(GL_TEXTURE1);
glBindTexture(GL_TEXTURE_RECTANGLE_ARB, 0);
glActiveTexture(GL_TEXTURE0);
glBindTexture(GL_TEXTURE_RECTANGLE_ARB, 0);
glPopMatrix();
glMatrixMode(GL_PROJECTION);
glPopMatrix();
glMatrixMode(GL_MODELVIEW);
glPopMatrix();
glPopAttrib();
glBindFramebuffer(GL_FRAMEBUFFER, 0);
[Please note that I omitted certain safety checks and comments for brevity]
After the above calls the video decoder sets a flag that the buffer can be swapped, and after the mainthread rendering loop from above, it checks for that flag and sets the frontBuffer/backBuffer accordingly. Also the used surfaces are marked as being free and available again.
In my original code when I used BGRA and uploads by glTexSubImage2D and glBegin and glEnd, I didn't experience any problems, but once I started improving things, using a shader to convert the YUV components to BGRA, and those DMA transfers, and glDrawArrays those issues started showing up.
Basically it seems partially like a tearing effect (btw I set the GL swap interval to 1 to sync with refreshes), and partially like it is jumping back a few frames in between.
I expected that having a pool of surfaces which I render to, and which get freed after rendering to the target surface, and double buffering that target surface should be enough, but obviously there needs to be more synchronization done in other places - however I don't really know how to solve that.
I assume that because glTexSubImage2D is now handled by DMA (and the function according to the documents supposed to return immediately) that the uploading might not be finished yet (and the next frame is rendering over it), or that I forgot (or don't know) about some other synchronization mechanism which I need for OpenGL (Mac).
According to OpenGL profiler before I started optimizing the code:
- almost 70% GLTime in glTexSubImage2D (ie uploading the 8MB BGRA to VRAM)
- almost 30% in CGLFlushDrawable
and after I changed the code to the above it now says:
- about 4% GLTime in glTexSubImage2D (so DMA seems to work well)
- 16% in GLCFlushDrawable
- almost 75% in glDrawArrays (which came as a big surprise to me)
Any comments on those results?
If you need any further info about how my code is set up, please let me know. Hints on how to solve this would be much appreciated.