Now it's not clear what this will get translated into post JIT so I converted it into C (changing the enum slightly to save a bit of space) and ran that through MSC in assembler output mode and got back a figure of 89 bytes for the code and 14 bytes for the table which I think is pretty good.
typedef enum
{ Unknown = 0, BomBigEndianUcs4=3, BomUcs4=7, BomUtf8=10,
BomUtf16=12, BomBigEndianUtf16=14
} Encoding;
// We use Bit 3 as a end of data marker, true means end
// bit 5 happens to be same value as bit 3
static unsigned char matchData[] =
{
0x00,0x00,0xF6,0xFF, // 0- 00 00 FE FF Bom UCS4 Big endian
0xF7,0xF6,0x00,0x08, // 4- FF FE 00 00 Bom UCS4 Little endian
0xE7,0xB3,0xBF, // 8- EF BB BF Bom UTF8
0xF7,0xFE, // 12 - FF FE Bom UTF16 Little endian
0xF6,0xFF // 14 - FE FF Bom UTF16 Big endian
};
static Encoding DetectType(unsigned char data[],int length)
{
int i = 0;
int offset = 0;
while (i < sizeof(matchData))
{
unsigned char compare = ((matchData[i] & 0xf7) | ((matchData[i] & 0x20) >> 2));
if ((offset >= length) || (data[offset] != compare))
{
offset = 0;
while ((matchData[i] & 0x08) == 0) i++;
}
else
{
if ((matchData[i] & 0x08) == 0x08) return i;
offset++;
}
i++;
}
return Unknown;
}
Following on from that I extended the code to support the detection of all of the stuff in Appendix F of the XML specification.
public enum Encoding
{
Unknown=0,BomBigEndianUcs4,BomUcs4,BomUtf8,
BigEndianUcs4,Ucs4Odd,BigEndianUcs4Odd,
Ucs4,BomUcs4Odd,BomBigEndianUcs4Odd,
BigEndianUtf16,Utf16,Ascii,EBCDIC,
BomUtf16,BomBigEndianUtf16
}
// We use Bit 3 as a end of data marker, true means end
// bit 5 happens to be same value as bit 3
// If we're doing a full appendix F of the XML docs then
// we change this and use bit 6 for the last EBCDIC entry and
// UTF16 Bom Entries
private static byte[] matchData =
{
0x00,0x00,0xF6,0xFF, // 0- 00 00 FE FF Bom UCS4 Big endian
0xF7,0xF6,0x00,0x08, // 4- FF FE 00 00 Bom UCS4 Little endian
0xE7,0xB3,0xBF, // 8- EF BB BF Bom UTF8
0x00,0x00,0x00,0x3c, // 11 - 00 00 00 3c UCS4 Big endian
0x00,0x00,0x34,0x08, // 15 - 00 00 3c 00 UCS4 Odd Little
0x00,0x34,0x00,0x08, // 19 - 00 3c 00 00 UCS4 Odd Big
0x34,0x00,0x00,0x08, // 23 - 3c 00 00 00 UCS4 Little endian
0x00,0x00,0xF7,0xFe, // 27 - 00 00 FF FE Bom UCS4 Odd Little
0xF6,0xF7,0x00,0x08, // 31 - FE FF 00 00 Bom UCS4 Ddd Big
0x00,0x34,0x00,0x3f, // 35 - 00 3c 00 3f UTF16 Big endian
0x34,0x00,0x37,0x08, // 39 - 3c 00 3f 00 UTF16 Little endian
0x34,0x37,0x70,0x6d, // 43 - 3c 3f 78 6d Ascii
0x44,0x67,0xa7,0x9c, // 47 - 4c 6f a7 94 EBCDIC
0xF7,0xFE, // 51 - FF FE Bom UTF16 Little endian
0xF6,0xFF // 53 - FE FF Bom UTF16 Big endian
};
public static Encoding DetectType(byte []data)
{
int i = 0;
int offset = 0;
Encoding currentEncoding = Encoding.BomBigEndianUcs4;
while (i < matchData.Length)
{
byte compare = (byte)((matchData[i] & 0xf7) | (i>=47?((matchData[i]&0x40) >> 3):(matchData[i]&0x20) >> 2));
if ((offset >= data.Length) || (data[offset] != compare))
{
offset = 0;
while ((matchData[i] & 0x08) == 0) i++;
currentEncoding++;
}
else
{
if ((matchData[i]&0x08)==0x08) return currentEncoding;
offset++;
}
i++;
}
return Encoding.Unknown;
}
Which I think isn't too bad a bit of code, if you could live without the EBCDIC detection you can drop the bytes from the table and the ternary statement and I'd guess your code wouldn't be too big.
I'll have a think about alternative methods over the weekend and see if I can beat the 103 bytes of assembler that my first attempt generated.