🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Handling "float" in in generic memory (blob)

Started by
20 comments, last by Juliean 3 years, 1 month ago

Juliean said:
And I could also just always set the interpreter.cpp to compile as “release”.

This is what I would suggest doing. My impression is that it is not unusual for large C++ codebases to have optimizations turned on across all files in all configurations devs use regularly, and optimizations are disabled at the per-file level when necessary.

Advertisement

Juliean said:
Puh, I'm not an expert on the c++-standard, but from the wording I've read (can't find it quickly right now) I always assumed that even this was illegal:

I'd have to look it up. I think the cast is legal as long as pMemory “is an int”, that is, in both C and C++ these are allowed

// 1. Round trip some type through signed/unsigned char* or void* and back again
int_ptr2 = (int*)(char*)int_ptr;
foo->user_ptr = (void*)&my_int; // or often a larger structure
int fd = *(int*)foo->user_ptr; // generally in a callback or such later

// 2. Take part of a char* buffer and treat it as some *single* type (basically a memory allocator)
char *memory = ...;
// can be any subset of memory, and might be via void*, but we must ensure correct alignment for the T* being made!
// malloc and friends I believe are speced to return a pointer aligned to all primitives types, but on some platforms this might still not be enough for some vector types outside the C/C++ standards
int *int_ptr = (int*)(memory + 20); 
*int_ptr = 20;
int y = *int_ptr + 5;
// On C++, you have placement new + delete, and is required for anything with a constructor or destructor!
// Still have to ensure alingment!
std::string *str = new(memory + 24) std::string("Hello world!");
str->~string(); // when done with it, does not deallocate "memory" but that block could now be used for another object safely

The bit I am not sure on in 2. is if this is legal, which would have to check

char *memory = ...;
int *int_ptr = (int*)(memory + 20);
*int_ptr = 5; // "allocated" memory
// later
int *int_ptr2 = (int*)(memory + 20); // Note I casted again, instead of using the same int_ptr as before
int x = *int_ptr2; // The compiler can see the cast and might consider this memory as not an integer and be UB

But this is also not exactly the same as what memcpy does, since it never casts the pointer to an incompatible type (you are only doing T* → char/void*, never char/void* → T*), it simply copies bytes.

void memcpy(void *dst, const void *src, size_t len) // I  beleive char* is equally valid
{
    char *dst2 = (char*)dst; // casting any pointer to char* is OK
    const char *src2 = (const char *)src;
    for (size_t i = 0; i < len; ++i)
        dst2[i] = src2[i]; // copying the char values is always legal
}

Also if I recall, copying the bytes from say an integer to a float or partial copies between different sizes is “implementation defined” rather than “undefined”. This is because the standard does not promise what the byte level format of these types is, and so it can't tell you what specific bit/byte values will be.

I think the newer standard promises twos-complement signed integers so copying between signed and unsigned integers of the same size is defined. But the floating point formats are still implementation specific, as is the sizes of char/short/int/long/float/double/etc. within certain constraints.

You will also encounter a similar thing with binary file or network IO.

@Oberon_Command Hmm, if I have time will have to see if can find that in the actual standards.

Of course as mentioned, in C++ you have placement new which should be used. Since C doesn't have this I believe it is OK to not use placement new specifically on primitive types which have no constructor or destructor and can be initialised by assignment.

And it does say “operations that begin lifetime of an array of type char, unsigned char, or std::byte, (since C++17) in which case such objects are created in the array,”. So my understanding here, is that m_stack is containing say a char stack[16*1024] or a vector of char, or such is meeting that?

And in practice, code uses many API's other than std::malloc or the other functions specifically listed, either their own allocators, or operating system provided ones, so not sure if cppreference is just providing a non-exhaustive list of examples. But at the very least it means every compiler I can think of is going to allow it, because Windows has VirtualAlloc etc. etc., Linux/posix has aligned_malloc etc.

SyncViews said:
I'd have to look it up. I think the cast is legal as long as pMemory “is an int”, that is, in both C and C++ these are allowed

Ah, that might be. But in my case, this would still be not allowed since I the data would actually be “float”. So if the compiler knew that I then tried to reinterpret the “float” as “int” to copy it, it shouldn't be allowed.
More so, the compiler simply has no way of knowing what the internal data actually is in my code (since I'm not passing it a data-source that it can trace through the last few lines of code or calls; but a pointer to a generic memory-pool that is written to and read from different sources). So the compiler can only eigther chose to ignore that fact that it knows nothing about what I'm trying to reinterpret_cast (which is what MSVC does) or treat it as UB and probably not read/write anything at all (which is what I belive other compilers may do).

Oberon_Command said:
This is what I would suggest doing. My impression is that it is not unusual for large C++ codebases to have optimizations turned on across all files in all configurations devs use regularly, and optimizations are disabled at the per-file level when necessary.

I just need to do some real-world testings before I can fully commit to that. I already tested that idea on its own and it worked. However, now I have implemented the ability to bind arbitrary c++-functions, this might still be an issue- since those functions essentially access the same “stack” operations Pop/Push, but compiled via template in the cpp-file where they are registered. I guess that the function-call itself is probably way more expensive then the added overhead of std::bit_cast or something for retrieving an int, but I still want to measure it in the actual game (which will still take me a while to get to run).

Juliean said:
Ah, that might be. But in my case, this would still be not allowed since I the data would actually be “float”. So if the compiler knew that I then tried to reinterpret the “float” as “int” to copy it, it shouldn't be allowed.

EDIT:

Well that depends on my question regarding case 2. It might be allowed, I am not sure. You initialise the memory, but it lets the pointer go out of scope, and later you cast again, which is the bit I think might break it.

So given the pointer goes out of scope, the compiler might say the assignment “has no effect” and optimise it out.
And in the read case the compiler might say “you are reading uninitialised memory” and so not do the read.

So the safe way I see is like memcpy which is operating on the chars, because that is defined as I recall, you could even just code it like this. I believe it is just the moment you try and go the extra step and cast to a larger type there is potential issues (even if the address is aligned, and if unaligned certainly could have problems).

static_assert(sizeof(int) == 4 && sizeof(int) == sizeof(float));
cosnt char *src = stack + offset;
char *dst = stack + stack_size;
stack_size += 4;
dst[0] = src[0];
dst[1] = src[1];
dst[2] = src[2];
dst[3] = src[3];

So in debug I guess this is stuck being a few times slower (4 separate moves vs one. Same overhead to compute the array offsets), but still faster than a memcpy call (which adds the function call overhead plus the generic memcpy implementation will have a loop of some form over size) in release it would probably be optimised to be the same as memcpy.

EDIT 2: Actually not sure if in release the optimisation would be as good. It probably has to assume that dst and src overlap which gives you memmove, while the internal implementations of memcpy can assume it does not. Not sure if you can do any of those internal tricks within standard C/C++.

SyncViews said:
So in debug I guess this is stuck being a few times slower (4 separate moves vs one. Same overhead to compute the array offsets), but still faster than a memcpy call (which adds the function call overhead plus the generic memcpy implementation will have a loop of some form over size) in release it would probably be optimised to be the same as memcpy. EDIT 2: Actually not sure if in release the optimisation would be as good. It probably has to assume that dst and src overlap which gives you memmove, while the internal implementations of memcpy can assume it does not. Not sure if you can do any of those internal tricks within standard C/C++.

Yeah, I just checked out of interest. The results are pretty interesting indeed:

https://godbolt.org/z/5cYoGjTeh

This is the code without any optimizations! Interestingly, the memcpy is pretty much being compiled to the same operation that would occur if you did x = y, while your version, well :D Thats why they say you should always measure, don't they.

I mean, thats still pretty insightful. I wasn't aware that memcpy is that good even in debug. I knew memcpy with a fixed size is heavily optimized (as its an intrinsic instead of a regular function), but that means I the whole discussion is solved anyway (I'll have to check MSVC as well, but yeah). Only thing thats still pretty shitty is that std::bit_cast has so much more overhead then (maybe its a MSVC-thing as well). Sure, memcpy is then the right choice but its still a lot more to type then std::bit_cast would be (and whats with the zero-overhead thing we got going in C++, huh?)

When I looked on MSVC debug memcpy was a call into the generic size version, but maybe there is a flag to get it to use intrinsic optimisations while still keeping most debug functionality. With optimisations definitely expect it to be equal or likely better.

Interesting that GCC does that without enabling optimisations.

I suppose you could macro it, since using a template/wrapper just gives you the debug overhead back.

EDIT: And just to confirm my original thought on memcpy, this looks about as good on GCC and MSVC release build as it is going to get.
The debug builds do mess around with the stack though. https://godbolt.org/z/Mve4x36x4

SyncViews said:
When I looked on MSVC debug memcpy was a call into the generic size version, but maybe there is a flag to get it to use intrinsic optimisations while still keeping most debug functionality. With optimisations definitely expect it to be better.

Perhaps I saw the same thing. I know that you can set whether or not to use “intrinsic” version of certain functions (https://docs.microsoft.com/en-us/cpp/intrinsics/compiler-intrinsics?view=msvc-160)​,​ but thought that it would only affect optimized builds.

SyncViews said:
Interesting that GCC does that without enabling optimisations.

I mean, it kind of also makes sense in hind-sight. At least when you consider that memcpy is treated as an intrinsic, and how intrinsics are handled. Or, at least how I personally handle intrinsics in my own compiler. Without going into much detail, but I also have the concept of functions which appear normally (as a node in my visual language), but are not actually functions but code-generators (not to be confused with macros). And there, I would do the same thing: Instead of emitting a “call” to memcpy and deferring it to the optimizer to remove, you evaluate the parameters, see that the size is fixed, and just produce a “copy 4 bytes” ASM right then and there.
At least thats my educated guess based on what I'm doing myself now (I of course don't have a memcpy specifically; but you do see this at other points like for-loops where I pretty much just copied the algorithm from what GCC seems to do under debug-builds).

SyncViews said:
I suppose you could macro it, since using a template/wrapper just gives you the debug overhead back.

Yeah, thats what I would probably do. I already started putting some things in the interpreter into macros, which I don't like from a point of view of code-cleanlyness. But my old system was way too much into keeping code clean VS performance, and I'm paying the price now. I mean, perhaps if the 32x speedup that I saw keeps up for the final game (not 100% sure if its going to be more or less), I might be able to refactor things back a bit.

“memcpy is treated as an intrinsic” but isn't that entirely a concept that GCC made up as an optimisation? And it only does it for known small sizes, it calls a memcpy function otherwise? I didn't think the C standard gave memcpy any particular special treatment compared to say memset.

void zero_int(char *data)
{
    memset(data, 0, sizeof(int)); // optimised in -O1 and above, function call otherwise
}

EDIT:

Juliean said:
But my old system was way too much into keeping code clean VS performance, and I'm paying the price now. I mean, perhaps if the 32x speedup that I saw keeps up for the final game (not 100% sure if its going to be more or less), I might be able to refactor things back a bit

Well all of this was talking about debug. In release if it doesn't inline smaller templates and other wrapper functions something is up, normally I managed to get it to do so and once inlined it ends up almost the same as if the logic was their directly.

So using macros should just be to potentially make a debug build more usable if don't want to mix flags on compilation units or need some header-only stuff to be fast.

SyncViews said:
“memcpy is treated as an intrinsic” but isn't that entirely a concept that GCC made up as an optimisation? And it only does it for known small sizes, it calls a memcpy function otherwise? I didn't think the C standard gave memcpy any particular special treatment compared to say memset.

Might very well be so. I might be totally wrong here. From own example, since I work in a visual language where everything is a node, I personally needed the concept of an “intrinsic” node for things like ifs and loops or int-add anyways. So it would feel natural in that system to treat the memcpy just like another node for genering code directly based on a condition. In that sense it is certainly a sort of “optimization”, but one that is applied during initial code-generation. Maybe it doesn't make fully sense in a text-based language, and its just what I came up with because it is more natural when you are dealing just with node and not grammar.
I did see that GCC also did the same thing for loops/ifs, where unreachable paths were discarded even without optimization, but perhaps thats also just local to that compiler. I did check and see that indeed, MSVC does call “memcpy” in debug-builds. But i didn't get the pragma intrinsic to run in godbolt, so I have to check that in my own Visual Studio.

EDIT: pragma intrinsic doesn't seem to change anything in visual studio. Seems that I'm going to stick with reinterpret_cast for that platform - if it insists. Doing a macro seems to be the most sane thing here after all - then I can choose the safe, defined behaviour on platforms where it matters (and where things like memcpy might be zero-overhead even in debug), and everywhere else I can stick to more dirty tricks. Would even allow me to benchmark different things to see if it matters later, without changing all the code.

This topic is closed to new replies.

Advertisement