🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

MultiThreading with games in c++

Started by
6 comments, last by SkitzFist 4 years, 3 months ago

Hello,

I'm currently at my first year at a game programming education at a university in Sweden.

For the last project we made I discovered how awesome it was to use threading for loading in textures and my brain just went nuts with the possibilites. I was thinking of threading rendering and much more.

Well, after hammering code for the last few days I've realised I've hit a wall.

I've tried to find youtube videos of threading for game programming but most videos I find is just the same basic usage of threading, just calling a function a single time.

Now I'm just wondering if any of you guys could point me to a direction where I can learn more about how threading works for game programming.

None

Advertisement

For games it's common to create a constant number of worker threads (count of cores) which keep running all the time and pick up work as it becomes available. (Creation of a thread is too expensive to do it for each task.)

Searching for ‘job system’ and ‘thread pool’ should bring you to game related tutorials or implementations.

@JoeJ Can you explain the grand scheme, please? Like I have 8 cores with hyper-threading so 16 threads. In some CPUs cores come in pairs and share a cache. So for a producer consumer pattern ( pipeline ) I would want to make sure that I use adjacent threads.

Then there is the OS, which distributes threads to the processes, and high-level runtimes, which distribute these threads to threads the programmer requests. How can you check for errors?

Here on GameDev.net I read that Id for Doom switched from the worker thread pool to a library by Intel. Now C++ is similar to Linq (.asParallel) in C#. Probably there is even some Async Await with Task-Objects. How are Tasks less expensive then treads?

Async-Await is coming in the upcoming C++ standard as far as I know so stay tuned, but even then this is not very suiteable in my opinion. The magic of async in C# is like the iterator pattern with the yield keyword, a state mashine in a giant switch-case function that the compiler generated for you.

Creating threads and have them put aside is expensive in Windows, it is some less expensive in Linux but I would always prefer ThreadPools. They are more scaleable and can distribute better to your cores. I always acquire twice the number of threads than I have cores on my system -1 (for the main process). I do this because I then block threads from the pool to have my JobSystem run on it.

I'm using concurrent multi-producer-multi-consumer queues.

My pool uses a worker per thread that has it's own job queue. There is then also a queue for initialized and one for pending workers. A worker then tries to dequeue a job from it's own queue and if that fails pick a random worker and try to dequeue a job from that worker. This is called a job-stealing implementation.

If a worker dosen't get jobs to do, it is changing into idel mode and sleeping. I'm using a semaphore for this that locks itself and is waked-up by another thread if needed. This way, my pool scales the workload down of no work is available to do.

As my ThreadPool is static class, I have a unified start function that will pass a pointer to the pool treated as job. The caller selects a worker in the row and awakes it if necessary, then adjusts the pointer to to the next worker. This way, all threads get dsitributed work equally.

Then I did some research a few years ago and found fibers. Fibers are very usefull mechanics in C++ and especially in game programming. You create threads that act as host processes and fibers that are small pieces of stack + their own CPU registers and data. Starrting a job as cooperative fiber executed thread is similar to putting something into the ThreadPool. The trick comes in when you want to ‘await’ certain condition. Anything from the CPU the thread is running on is copied to a dedicated piece of memory and the then swapped with another jobs memory. What you achieve here is real scheduling without letting the OS do things with your threads and you are really stopping at a point in code and not just setting a state-mashine pointer and no code generation involved.

You can have a look at an implementation that is based on a GDC talk from NaughtyDog on GitHub. Took some time for me to implement my own one and debugging it at the correct place is horrible but worth the pain in my opinion because you are able to schedule your game's work far better than with just the OS scheduler

arnero said:
I would want to make sure that I use adjacent threads.

I have the same question - not sure if it is possible to get information which threads run on the same core or which cores run an the same chiplet etc.

But in practice it might not matter much. If you have a thread pool and two large tasks, e.g. animating characters and frustum culling of objects, you would get lower level cache sharing by the usual practice.
You would divide the work into jobs of animating N characters each (the worker threads pick it up in roughly the same order, likely all processing data from similar memory regions), and after that the same for the object culling (with the same memory access pattern).

In contrast, if you would use just two threads - one doing characters and the other doing objects - chance would be to have worse cache utilization.

arnero said:
Then there is the OS, which distributes threads to the processes, and high-level runtimes, which distribute these threads to threads the programmer requests. How can you check for errors?

You have to synchronize your dependencies yourself, e.g. to ensure all jobs of a certain kind are done before another kind of work is allowed to start processing.
This means there are periods where threads try to pop some work but there is nothing to do yet.
You can add a yield() into the loop that ends up constantly trying to get work but doing nothing else, so the OS get's the thread and can do something useful with it while waiting.
But the OS does its thing anyways transparently, and personally i did not see a difference from having a yield, neither in performance or shown CPU utilization.
So you have to worry only about your own stuff, not about OS.
(Using std::thread the only OS specific thing i have to do is setting thread priorities, e.g. if i want a responsive thread for user interface and lower priority background workers.)

arnero said:
Doom switched from the worker thread pool to a library by Intel. Now C++ is similar to Linq (.asParallel) in C#. Probably there is even some Async Await with Task-Objects. How are Tasks less expensive then treads?

Probably meaning Intel Threading Building Blocks library.
I have not used it, but it's probably much more advanced than C++. C++ does not offer a job system yet. I think there is something planned for the future, but it's not yet specified.
Async tasks, std::future and all this is probably not what you want for games, because it needs to start up threads for each task which is much too slow. (Let's say starting a thread costs 0.5 ms!)
But you can implement a job system yourself with std::thread of course, likely using atomic counters to push and pop work, and i also use atomics for sync.

I use only a very small subset of std::thread functionality for realtime purposes, while i use much more of it for offline tools or editor. I don't feel super experienced with MT myself yet, but there seems a clear difference between realtime and non realtime design.

JoeJ said:
But you can implement a job system yourself with std::thread of course, likely using atomic counters to push and pop work, and i also use atomics for sync.

Well, `std::thread` itself is mostly a low level component, along with the atomics, locks, condition variables, etc. So to write a job system or other low level threaded code, something to consider rather than the Window's or pthread API, certain intrinsics, etc.

As well as the single API, the main advantage is handling things like passing parameters, return types, and capturing exceptions, while the platform API might basically just give you a “void*” sized parameter and an integer exit status for you to build stuff on top of.

In theory, the C++ async task/job system is `std::async`. An implementation may implement it with a thread pool to avoid the thread starting cost (I believe most do), and may restrict it to a fairly optimal max thread limit (don't recall noticing any do this, probably because they assume your threads will block and stuff). But you have very little control over this.

And when making your own thread pool, you can still make use of `std::future` etc. to get basically the same API, but with much more control. Although quite possibly you can gain a bit more performance without any of those C++ library things.

JoeJ said:
I have the same question - not sure if it is possible to get information which threads run on the same core or which cores run an the same chiplet etc.

There are API's to get the NUMA nodes, physical core count, logical core count, etc. The only multi-NUMA system you are likely to come across is 1st and 2nd Gen Threadripper. Anything more detailed, I think you would have to just code for the specific CPU series if you were interested, unless there are API's I missed detailing say CCX layout, and those will be forward compatible with whatever the next AMD/Intel/etc. design is.

You can set per-thread affinity. But at this point you are basically trying to out think the system scheduler. Would certainly need a lot of testing, because you have to consider what the 100's or 1000's of other processes/threads on the system are wanting to do. On something like say a console, where you can code to a specific design, this could make a lot more sense.

e.g. If you restrict one of your threads to a specific logical core, and the CPU is almost idle and your thread wants to run, but oh Chrome, or Steam, or whatever is currently on that core, well now you don't run until it's time slice is done, and are almost certainly a lot worse off.

Maybe the OS is clever enough to go “oh that thread wants to run 99% of the time and it's only on core 2, it's not running right now but still maybe I better put this other thing not on core 2”, but honestly that would surprise me.

Now maybe instead you try a mid way and lock all your threads to one logical core per physical core. This would still need a lot of testing though, and you still can't stop other programs running on the same physical core as you with SMT (at least without some nasty hacking to change other programs affinity dynamically).

JoeJ said:

For games it's common to create a constant number of worker threads (count of cores) which keep running all the time and pick up work as it becomes available. (Creation of a thread is too expensive to do it for each task.)

Searching for ‘job system’ and ‘thread pool’ should bring you to game related tutorials or implementations.

Thank you so much! ?

None

This topic is closed to new replies.

Advertisement