How would you create a string of all UTF-8 characters?
How would you create a string of all UTF-8 characters?
There are many ways to represent the +1 million UTF-8 characters. Take the latin capital "A" with macron (?
). This is unicode code point U+0100
, hex number 0xc4 0x80
, decimal number 196 128
, and binary 11000100 10000000
.
I would like to create a collection of the first 65,535 UTF-8 characters for use in testing applications. These are all unicode characters up to code point U+FFFF
(byte3).
Is it possible to do something like a for($x=0)
loop and then convert the resulting decimal to another base (like hex) which would allow the creation of the matching unicode character?
I can create the value ?
using something like this:
$char = "\xc4\x80"; // or $char = chr(196).chr(128);
However, I am not sure how to turn this into an automated process.
// fail! $char = "\x". dechex($a). "\x". dexhex($b);
Answer by Your Common Sense for How would you create a string of all UTF-8 characters?
:) of course last one wouldn't work. \x sequence belongs to the double-quoted strings.
what's wrong with $char = chr(196).chr(128);
? with chr($a).chr($b) I mean.
Answer by Tim Pietzcker for How would you create a string of all UTF-8 characters?
I'm not sure you can do this programmatically, mostly because there is a difference between a Unicode code point and a character. See http://www.unicode.org/standard/where for a few examples of characters that are represented by a combination of code points.
Some code points make no sense on their own and can only be used in conjunction with another character (think accents). See http://www.unicode.org/charts/charindex.html for a list of code points, and look at the section with all the "combining" code points.
Also, for use in testing applications, you'd need something else besides a list of possible UTF-8 code points, namely several invalid/malformed UTF-8 sequences that your app needs to be able to recover gracefully from.
For this, take a look at Markus Kuhn's Unicode stress test.
Answer by drawnonward for How would you create a string of all UTF-8 characters?
I quickly translated this from C, but it should give you the idea:
function encodeUTF8( $inValue ) { $result = ""; if ( $inValue < 0x00000080 ) { $result .= chr( $inValue ); $extra = 0; } else if ( $inValue < 0x00000800 ) { $result .= chr( 0x00C0 | ( ( $inValue >> 6 ) & 0x001F ) ); $extra = 6; } else if ( $inValue < 0x00010000 ) { $result .= chr( 0x00E0 | ( ( $inValue >> 12 ) & 0x000F ) ); $extra = 12; } else if ( $inValue < 0x00200000 ) { $result .= chr( 0x00F0 | ( ( $inValue >> 18 ) & 0x0007 ) ); $extra = 18; } else if ( $inValue < 0x04000000 ) { $result .= chr( 0x00F8 | ( ( $inValue >> 24 ) & 0x0003 ) ); $extra = 24; } else if ( $inValue < 0x80000000 ) { $result .= chr( 0x00FC | ( ( $inValue >> 30 ) & 0x0001 ) ); $extra = 30; } while ( $extra > 0 ) { $result .= chr( 0x0080 | ( ( $inValue >> ( $extra -= 6 ) ) & 0x003F ) ); } return $result; }
The logic is sound but I am not sure about the php so be sure to check it over. I have never tried to use chr
like this.
There are a lot of values that you would not want to encode, like 0xD000-0xDFFF, 0xE000-0xF8FF and 0xFFF0-0xFFFF, and there are several other gaps for combining characters and reserved characters.
Answer by bobince for How would you create a string of all UTF-8 characters?
You can leverage iconv
(or a few other functions) to convert a code point number to a UTF-8 string:
function unichr($i) { return iconv('UCS-4LE', 'UTF-8', pack('V', $i)); } $codeunits = array(); for ($i = 0; $i<0xD800; $i++) $codeunits[] = unichr($i); for ($i = 0xE000; $i<0xFFFF; $i++) $codeunits[] = unichr($i); $all = implode($codeunits);
(I avoided the surrogate range 0xD800?0xDFFF as they aren't valid to put in UTF-8 themselves; that would be ?CESU-8?.)
Answer by Php'Regex for How would you create a string of all UTF-8 characters?
>6,1<<7|191&$n): ($n<(1<<16)?pack($f,224|$n>>12,1<<7|63&$n>>6,1<<7|63&$n): ($n<(1<<20|1<<16)?pack($f,240|$n>>18,1<<7|63&$n>>12,1<<7|63&$n>>6,1<<7|63&$n):''))); } echo implode('',array_map('chr_utf8',range(0,65535))); // Output a big string, you can increase the range to 1114111?
Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72
0 comments:
Post a Comment