🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Could anyone help with UTF-8 line breaking in C?

Started by
8 comments, last by Nagle 3 years, 10 months ago

C, not C++. Just want to say that upfront.

So, we maintain an old libre open source game called Project: Starfighter, originally developed by Parallel Realities in 2003. As you might imagine, it didn't originally support Unicode; we added unicode support (or more specifically UTF-8 support) in later.

Unfortunately it seems one aspect of the Unicode system we implemented is broken: line wrapping. We thought we had this solved with the Pango library (the only one we could find), which reports where line breaks can be made… but it reports on a grapheme cluster level, not at a code point level. So if a grapheme cluster (like, say, あ) contains multiple code points (three in this case), it reports all three as breakable because it's actually reporting for the character and not the specific code point.

Full detail here: https://github.com/pr-starfighter/starfighter/issues/9

So my question is, can anyone offer insight on what should be done here? For reference, this is the function that's supposed to handle this:

int gfx_renderUnicodeBase(const char *in, int x, int y, int real_x, int fontColor, int wrap, SDL_Surface *dest)
{
	SDL_Surface *textSurf;
	SDL_Color color;
	int w, h;
	int avail_w;
	int changed;
	int breakPoints[STRMAX];
	int nBreakPoints;
	char testStr[STRMAX];
	char remainingStr[STRMAX];
	PangoLogAttr logAttrs[STRMAX];
	int nLogAttrs;
	int i;
	SDL_Rect area;

	if (strcmp(in, "") == 0)
		return y;

	avail_w = dest->w - real_x;

	switch (fontColor)
	{
		case FONT_WHITE:
			color.r = 255;
			color.g = 255;
			color.b = 255;
			break;
		case FONT_RED:
			color.r = 255;
			color.g = 0;
			color.b = 0;
			break;
		case FONT_YELLOW:
			color.r = 255;
			color.g = 255;
			color.b = 0;
			break;
		case FONT_GREEN:
			color.r = 0;
			color.g = 255;
			color.b = 0;
			break;
		case FONT_CYAN:
			color.r = 0;
			color.g = 255;
			color.b = 255;
			break;
		case FONT_OUTLINE:
			color.r = 0;
			color.g = 0;
			color.b = 10;
			break;
		default:
			color.r = 255;
			color.g = 255;
			color.b = 255;
	}

	if (gfx_unicodeFont != NULL)
	{
		strcpy(remainingStr, in);
		if (TTF_SizeUTF8(gfx_unicodeFont, remainingStr, &amp;amp;w, &amp;amp;h) < 0)
		{
			engine_error(TTF_GetError());
		}

		changed = wrap;
		while (changed &amp;amp;&amp;amp; (w > avail_w))
		{
			nLogAttrs = strlen(remainingStr) + 1;
			pango_get_log_attrs(remainingStr, strlen(remainingStr), -1, NULL, logAttrs, nLogAttrs);

			nBreakPoints = 0;
			for (i = 0; i < nLogAttrs; i++)
			{
				if (logAttrs[i].is_line_break)
				{
					breakPoints[nBreakPoints] = i;
					nBreakPoints++;
				}
			}

			changed = 0;
			for (i = nBreakPoints - 1; i >= 0; i--)
			{
				strncpy(testStr, remainingStr, breakPoints[i]);
				testStr[breakPoints[i]] = '\0';
				if (TTF_SizeUTF8(gfx_unicodeFont, testStr, &amp;amp;w, &amp;amp;h) < 0)
				{
					engine_error(TTF_GetError());
				}
				if (w <= avail_w)
				{
					textSurf = TTF_RenderUTF8_Blended(gfx_unicodeFont, testStr, color);
					if (textSurf == NULL)
					{
						printf("While rendering testStr \"%s\" as unicode...\n", testStr);
						engine_error("Attempted to render UTF8, got null surface!");
					}

					area.x = x;
					area.y = y;
					area.w = textSurf->w;
					area.h = textSurf->h;
					if (SDL_BlitSurface(textSurf, NULL, dest, &amp;amp;area) < 0)
					{
						printf("BlitSurface error: %s\n", SDL_GetError());
						engine_showError(2, "");
					}
					SDL_FreeSurface(textSurf);
					textSurf = NULL;
					y += TTF_FontHeight(gfx_unicodeFont) + 1;

					memmove(remainingStr, remainingStr + breakPoints[i],
						(strlen(remainingStr) - breakPoints[i]) + 1);
					changed = 1;
					break;
				}
			}

			if (TTF_SizeUTF8(gfx_unicodeFont, remainingStr, &amp;amp;w, &amp;amp;h) < 0)
			{
				engine_error(TTF_GetError());
			}
		}
		textSurf = TTF_RenderUTF8_Blended(gfx_unicodeFont, remainingStr, color);
		if (textSurf == NULL)
		{
			printf("While rendering remainingStr \"%s\" as unicode...\n", remainingStr);
			engine_error("Attempted to render UTF8, got null surface!");
		}

		area.x = x;
		area.y = y;
		area.w = textSurf->w;
		area.h = textSurf->h;
		if (SDL_BlitSurface(textSurf, NULL, dest, &amp;amp;area) < 0)
		{
			printf("BlitSurface error: %s\n", SDL_GetError());
			engine_showError(2, "");
		}
		SDL_FreeSurface(textSurf);
		textSurf = NULL;
		y += TTF_FontHeight(gfx_unicodeFont) + 1;
	}
	else
	{
		engine_warn("gfx_unicodeFont is NULL!");
	}

	return y;
}
Advertisement

It's actually straightforward to check if a byte in a UTF-8 string is the start of a character or not. Here's some code:

int IsStartOfUTF8Character(unsigned char c)
{
    // Top bit not set - single byte ascii character
    if ((c & 0x80) == 0) return TRUE;

    // Top two bits set - start of multi byte character
    if (c & (0x80 | 0x40)) return TRUE;

    // Top bit is set, but second bit isn't set - we're somewhere in the middle of a character
    return FALSE;
}

If you want to understand how that works, take a look at https://www.instructables.com/id/Programming--how-to-detect-and-read-UTF-8-charact/​

Note that it doesn't handle things like combining diacritics, but hopefully Pango handles those nicely for you.

OK, so, I'm confused, your function doesn't seem to work for me. Just taking an example I know, the hiragana さ is composed of the bytes 0xE3, 0x81, and 0x95 (at least according to Python), which all cause that function to return true, from the second test. This is all based on manually crunching the numbers in Python, but tests within the C context I tried to use this in seemed to give me the same result.

EDIT: Ah! Figured that out. Your second test is just misformed, should be:

int IsStartOfUTF8Character(unsigned char c)
{
    // Top bit not set - single byte ascii character
    if ((c & 0x80) == 0) return TRUE;

    // Top two bits set - start of multi byte character
    if ((c & 0x80) && (c & 0x40)) return TRUE;

    // Top bit is set, but second bit isn't set - we're somewhere in the middle of a character
    return FALSE;
}

The comments saved me there. Anyway, thanks! With that correction in methodology it seems to work. ?

My apologies, I guess I typed that code a bit too quickly.

What I should have written is this - it does the same thing as your fixed version, but possibly slightly more efficiently:

int IsStartOfUTF8Character(unsigned char c)
{
    // Top bit not set - single byte ascii character
    if ((c & 0x80) == 0) return TRUE;

    // Top two bits set - start of multi byte character
    if ((c & (0x80 | 0x40)) == (0x80 | 0x40)) return TRUE;

    // Top bit is set, but second bit isn't set - we're somewhere in the middle of a character
    return FALSE;
}

Noted, thanks!

So it sounds as if Pango library is not handling UTF8 properly isn't it?

Wouldn't it be more suitable to instead read characters as UTF8/ int32 data instead and decode them properly by default? So you can still decide on a per-character basis if the full UTF character is breakable

So it sounds as if Pango library is not handling UTF8 properly isn't it?

No, it's handling it right. It's just reporting at the character level rather than at the code point level.

It's poorly documented, but it's not handling it improperly.

Wouldn't it be more suitable to instead read characters as UTF8/ int32 data instead and decode them properly by default? So you can still decide on a per-character basis if the full UTF character is breakable

If you're talking about doing the kind of thing C++ code seems to do (where it gets converted into an array of 32-bit integers or whatever), then in this case, no. Remember that the code we're talking about is in C, and it ultimately passes on UTF-8 data to TTF_RenderUTF8_Blended, which expects UTF-8 encoded data in a simple char array (and does so quickly).

Actually, the reason we chose to support exclusively UTF-8 is because it can be represented in a simple char array. Converting to some other form would defeat the purpose of that by adding needless complexity. (It would be different if we wanted to support encoding schemes other than UTF-8, but for the purposes of Project: Starfighter UTF-8 alone is sufficient. We're also not aware of any library in C that handles all Unicode formats; Pango was all we could find, although that might partly be because search queries always get muddied with results for C++.)

Pango appears to use functions such as `g_utf8_next_char` and `g_utf8_strlen` (via `pango_utf8_strlen`) for basically treating inputs like UTF-32 codepoint sequences.

diligentcircle said:
If you're talking about doing the kind of thing C++ code seems to do (where it gets converted into an array of 32-bit integers or whatever), then in this case, no. Remember that the code we're talking about is in C, and it ultimately passes on UTF-8 data to TTF_RenderUTF8_Blended, which expects UTF-8 encoded data in a simple char array (and does so quickly).

Interestingly, `TTF_SizeUTF8(_Internal)` converts each element to a codepoint (UTF-32), and TTF_RenderUTF8_Blended calls that plus then converts each element to codepoints in it's own loop. Also both functions call SDL_strlen right at the start, so if they just took a length could have skipped the strncpy stuff (and I hope STRMAX is checked by every caller).

I guess it saves a tiny bit of memory, but interesting how the first thing many libraries are doing is get codepoints plus some form of length. I wonder if better fitting stuff into the cache etc. pays off? Even if not, suppose it is a fairly small cost compared to actually drawing a glyph/character.

Not sure how SDL_ttf gets on with combining characters, was something on my list to do while investigating Unicode with FreeType. My basic understanding is that is another part of Pango or HarfBuzz.

For what it's worth, I once wrote grapheme-aware word wrap for UTF8.

In Rust.

This may or may not be helpful.

https://github.com/John-Nagle/rust-rssclient/blob/master/src/wordwrap.rs

This topic is closed to new replies.

Advertisement