Reducing data copies in C# with Span<T> and pointers

Coming immediately off the heels of C# garbage collection profiling, I wanted to dive a little deeper into performance optimizations I made in the image loading routines for EndlessClient.

The need for this performance improvement originated from two things - a desire to reduce the number of data copies when loading images, and a genuine need to reduce the number of garbage collections EndlessClient was triggering.

EndlessClient and PE files

EndlessClient is developed using Monogame. Monogame has a decent content management system, wherein you can load textures by name. Textures are precompiled by the Monogame content pipeline when your application is first built.

var texture = Content.Load<Texture2D>("mytexture");

The content pipeline is used for some textures in EndlessClient - where the original Endless Online client drew rounded rectangle primitives, EndlessClient loads static assets that are sized appropriately since it has no rounded rectangle primitive (for chat bubbles and party health indicators).

These textures are in the vast minority though. The original game assets are stored in files with the extension “EGF” (Endless Graphics File), and referred to as “gfx” after the subfolder in which they were distributed. These EGF files are compiled resource files stored in Windows Portable Executable format.

The traditional way of accessing these files is to either use C++ and make Win32 calls directly, or P/Invoke them. The calls are LoadLibrary and LoadImage, with the MAKEINTRESOURCE macro as an input to LoadImage.

For cross-platform support, a custom resource loader is required. At the time of authoring, there were no good cross-platform PE resource loaders available for C#. So, I found a format specification and wrote my own, and what emerged was PELoaderLib.

The original approach

PELoaderLib has gone through a few minor iterations. Prior to .Net core, the library interface used System.Drawing types since the mono runtime provided an implementation for them, making cross-platform image loading very easy. After switching to dotnet core, raw byte[] data was returned instead, and piped to appropriate image loading routines in the client itself. I don’t want to focus too much on the public interface; we’re instead going to dive into the implementation details of the core class: PEFile.

Initializing the data

PEFile is essentially a wrapper around a memory-mapped file. The appropriate data is parsed when the file is initialized, and the header records are cached for quick lookups later.

private MemoryMappedFile _file;
private MemoryMappedViewStream _fileStream;

_file is our file handle, while _fileStream is a sequential view into the file itself. Part of using a sequential-access view stream is that we need a lot of routines for stream manipulation. Requests to load data don’t come in a neat order, meaning we’ll be jumping around to different offsets.

#region Stream Manipulation

private void SetStreamToDOSHeaderOffset()
{
    _fileStream.Seek(DOSHeader.e_lfanew, SeekOrigin.Begin);
}

private void SetStreamToStartOfImageFileHeader()
{
    SetStreamToDOSHeaderOffset();
    _fileStream.Seek(SIZE_OF_NT_SIGNATURE, SeekOrigin.Current);
}

private void SetStreamToStartOfOptionalFileHeader()
{
    SetStreamToStartOfImageFileHeader();
    _fileStream.Seek(ImageFileHeader.IMAGE_FILE_HEADER_SIZE, SeekOrigin.Current);
}

private void SetStreamToStartOfSectionHeaders()
{
    SetStreamToStartOfOptionalFileHeader();
    _fileStream.Seek(OptionalFileHeader.OPTIONAL_FILE_HEADER_SIZE, SeekOrigin.Current);
}

private void SetStreamToStartOfResourceSection(ImageSectionHeader resourceHeader)
{
    _fileStream.Seek((int)resourceHeader.PointerToRawData, SeekOrigin.Begin);
}

#endregion

Let’s look at the data processing next. PE files use “resource directory entries” to describe the offsets into the file for where the data actually lives. These directories are pre-populated in PELoaderLib so that access can be faster when actually getting the data. The initialization logic by itself is pretty simple.

BuildLevelOneCache();

var resourceTypes = (ResourceType[])Enum.GetValues(typeof(ResourceType));
foreach (var resourceType in resourceTypes)
{
    BuildLevelTwoCache(resourceType);

    if (_levelTwoCache.ContainsKey(resourceType))
    {
        foreach (var level2Entry in _levelTwoCache[resourceType].Values)
            BuildLevelThreeCache(resourceType, level2Entry);
    }
}

// BuildLevelOneCache and BuildLevelTwoCache have very similar implementations.
private void BuildLevelThreeCache(ResourceType resourceType, ResourceDirectoryEntry level2Entry)
{
    if (!_levelTwoCache.ContainsKey(resourceType))
        return;

    if (!_levelThreeCache.ContainsKey(resourceType))
        _levelThreeCache[resourceType] = new Dictionary<int, List<(int CultureID, ResourceDirectoryEntry Entry)>>();

    var resourceSectionHeader = _sectionMap[DataDirectoryEntry.Resource];

    var resourceDirectoryFileOffset = resourceSectionHeader.PointerToRawData + ResourceDirectory.RESOURCE_DIRECTORY_SIZE;

    _fileStream.Seek(resourceDirectoryFileOffset + (level2Entry.OffsetToData & 0x7FFFFFFF), SeekOrigin.Begin);

    var l3CacheRef = _levelThreeCache[resourceType];

    ResourceDirectoryEntry level3Entry;
    do
    {
        level3Entry = GetResourceDirectoryEntryAtCurrentFilePosition();
        if (!l3CacheRef.ContainsKey((int)level2Entry.Name))
            l3CacheRef.Add((int)level2Entry.Name, new List<(int, ResourceDirectoryEntry)>());

        l3CacheRef[(int)level2Entry.Name].Add(((int)level3Entry.Name, level3Entry));
    } while (level3Entry.Name != 0);
}

Reading bitmap data

Now that we’ve built a cache of all the resources contained in a PE file, we need a way to get the actual image data out.

Bitmap data is retrieved by looking up the resource ID for Bitmap resource types. The caches that were previously built during initialization contain offsets to the bitmap data, but we need to check that the expected entry is present in each of the caches:

Level One represents a mapping of resource types to resource directory entries. Each entry contains an offset to the level two cache data.
Level Two represents a mapping of resource types to sets of resource directory entries. Each entry contains an offset to the start of the level three cache data.
Level Three represents a mapping of resource types to sets of resource directory entries, with an additional culture ID specifier. Each entry contains an offset to the start of the bitmap resource data.

When we get the data out of the PE file, we check that the level 1 and level 2 caches have the expected resource type key. Otherwise, we will never have initialized the level 3 cache that actually contains what we’re looking for. Next, we examine the level 3 cache entry for the given resource ID and culture. If culture is not specified, we take the first entry with the specified resource ID.

Putting that all together, the implementation looks like this:

public byte[] GetEmbeddedBitmapResourceByID(int intResource, int cultureID = -1)
{
    if (!Initialized)
        throw new InvalidOperationException("The PE File must be initialized first");

    var bytes = GetResourceByID(ResourceType.Bitmap, intResource, cultureID);

    if (bytes == null || bytes.Length == 0)
        throw new ArgumentException(string.Format("Error loading the resource: could not find the specified resource for ID {0} and Culture {1}", intResource, cultureID));

    return PrependBitmapFileHeaderToResourceBytes(bytes);
}

public byte[] GetResourceByID(ResourceType resourceType, int intResource, int cultureID = -1)
{
    if (!Initialized)
        throw new InvalidOperationException("The PE File must be initialized first");

    if (!_levelOneCache.ContainsKey(resourceType))
        return new byte[0];

    return FindMatchingLevel2ResourceEntry(resourceType, intResource, cultureID);
}

private byte[] PrependBitmapFileHeaderToResourceBytes(byte[] resourceBytes)
{
    var totalFileSize = (uint)(resourceBytes.Length + BitmapFileHeader.BMP_FILE_HEADER_SIZE);
    var retArray = new byte[totalFileSize];

    var headerSize = BitConverter.ToInt32(resourceBytes, 0);
    var bitmapHeaderBytes = resourceBytes.Take(headerSize).ToArray();

    var bitmapFileHeader = new BitmapFileHeader(totalFileSize, bitmapHeaderBytes);
    bitmapFileHeader.ToByteArray().CopyTo(retArray, 0);

    resourceBytes.CopyTo(retArray, BitmapFileHeader.BMP_FILE_HEADER_SIZE);

    return retArray;
}

private byte[] FindMatchingLevel2ResourceEntry(ResourceType resourceType, int resourceID, int cultureID)
{
    if (!_levelTwoCache.ContainsKey(resourceType))
        return new byte[0];

    return GetResourceDataForCulture(resourceType, resourceID, cultureID);
}

private byte[] GetResourceDataForCulture(ResourceType resourceType, int resourceID, int cultureID)
{
    var resourceSectionHeader = _sectionMap[DataDirectoryEntry.Resource];
    var l3CacheRef = _levelThreeCache[resourceType];

    if (!l3CacheRef.ContainsKey(resourceID) || (cultureID >= 0 && !l3CacheRef[resourceID].Any(x => x.CultureID == cultureID)))
        return new byte[0];

    if (cultureID < 0)
    {
        cultureID = l3CacheRef[resourceID].First().CultureID;
    }

    var resourceDataEntry = GetResourceDataEntryAtOffset(l3CacheRef[resourceID].First(x => x.CultureID == cultureID).Entry.OffsetToData);
    var resourceDataOffset = resourceSectionHeader.PointerToRawData + resourceDataEntry.OffsetToData - resourceSectionHeader.VirtualAddress;

    _fileStream.Seek(resourceDataOffset, SeekOrigin.Begin);
    var bytes = new byte[resourceDataEntry.Size];
    _fileStream.Read(bytes, 0, bytes.Length);

    return bytes;
}

Note that the level one and two caches are not used for anything on lookups other than validation that the expected resource type key exists

Problem 1: Sequential Access

One problem with this approach is the need to constantly change the file stream position based on the data we’re trying to read. There isn’t much of an overhead here, but it is a bit unnecessary given that C# also provides us with a random-access view into the file.

Problem 2: Data copies

Looking entirely at PELoaderLib, we have a number of data allocations/copies that discard the source array. Here they are collected together for easy viewing:

// COPY 1 :: Getting the data out of the file
var bytes = new byte[resourceDataEntry.Size];
_fileStream.Read(bytes, 0, bytes.Length);

// COPY 2 :: Prepending the bitmap file header to a new byte array
var retArray = new byte[totalFileSize];
// ... more logic
bitmapFileHeader.ToByteArray().CopyTo(retArray, 0);
resourceBytes.CopyTo(retArray, BitmapFileHeader.BMP_FILE_HEADER_SIZE);

Once PELoaderLib is done handling the data, we have even more file copies:

// COPY 3 :: Copying the data to a SixLabors.ImageSharp IImage
var fileBytes = _modules[file].GetEmbeddedBitmapResourceByID(resourceValue + 100);
return Image.Load(fileBytes);

// Getting the image's memory data (no copy)
image.DangerousTryGetSinglePixelMemory(out var mem);

var ret = new Texture2D(_graphicsDeviceProvider.GraphicsDevice, i.Width, i.Height);
// COPY 4 :: creating an array out of the memory (ToArray call)
// COPY 5 :: copying the ToArray result to the actual texture
ret.SetData(mem.ToArray());

This is a total of five copies for the same data for a given bitmap. While actual performance wasn’t noticeably bad, it still left a bad taste in my mouth.

The new approach

Random file access

Starting with PELoaderLib, I first wanted to address the sequential access issue in favor of a random file accessor. For this, I leveraged MemoryMappedViewAccessor instead of MemoryMappedViewStream. This required some rewriting of the code since there was no longer the same stream manipulation capability, but it wasn’t too involved.

The primary change was tracking the offset when iterating ResourceDirectoryEntries for building the caches.

private void BuildLevelThreeCache(ResourceType resourceType, ResourceDirectoryEntry level2Entry)
{
    if (!_levelTwoCache.ContainsKey(resourceType))
        return;

    if (!_levelThreeCache.ContainsKey(resourceType))
        _levelThreeCache[resourceType] = new Dictionary<int, List<(int CultureID, ResourceDirectoryEntry Entry)>>();

    var resourceSectionHeader = _sectionMap[DataDirectoryEntry.Resource];

    var resourceDirectoryFileOffset = resourceSectionHeader.PointerToRawData + ResourceDirectory.RESOURCE_DIRECTORY_SIZE;

    var offset = resourceDirectoryFileOffset + (level2Entry.OffsetToData & 0x7FFFFFFF);

    var l3CacheRef = _levelThreeCache[resourceType];

    ResourceDirectoryEntry level3Entry;
    do
    {
        level3Entry = GetResourceDirectoryEntryAtOffset(offset);
        if (!l3CacheRef.ContainsKey((int)level2Entry.Name))
            l3CacheRef.Add((int)level2Entry.Name, new List<(int, ResourceDirectoryEntry)>());

        l3CacheRef[(int)level2Entry.Name].Add(((int)level3Entry.Name, level3Entry));
        offset += ResourceDirectoryEntry.ENTRY_SIZE;
    } while (level3Entry.Name != 0);
}

private ResourceDirectoryEntry GetResourceDirectoryEntryAtOffset(uint offset)
{
    if (offset > _fileAccessor.SafeMemoryMappedViewHandle.ByteLength)
        return new ResourceDirectoryEntry(0, 0);

    var directoryEntryArray = new byte[ResourceDirectoryEntry.ENTRY_SIZE];
    _fileAccessor.ReadArray(offset, directoryEntryArray, 0, directoryEntryArray.Length);
    return new ResourceDirectoryEntry(BitConverter.ToUInt32(directoryEntryArray, 0),
                                        BitConverter.ToUInt32(directoryEntryArray, 4));
}

Note that we now update the offset variable and pass this to GetResourceDirectoryEntryAtOffset, so we can read the data array out of the file at the appropriate location.

Reducing data copies

My goal in these changes was to take the raw data directly out of the file without needing to copy it anywhere, which I’m pretty sure is what Windows does in LoadImage. To accomplish this, I wanted to represent the image data as a Span<T> or Memory<T> of bytes.

From MSDN:

ReadOnlySpan<T> is a ref struct that is allocated on the stack and can never escape to the managed heap. Ref struct types have a number of restrictions to ensure that they cannot be promoted to the managed heap, including that they can’t be boxed, captured in lambda expressions, assigned to variables of type Object, assigned to dynamic variables, and they cannot implement any interface type.
A ReadOnlySpan<T> instance is often used to reference the elements of an array or a portion of an array. Unlike an array, however, a ReadOnlySpan<T> instance can point to managed memory, native memory, or memory managed on the stack.

I ended up choosing IReadOnlySpan<T> since it represented my use case; I have a portion of an array that I want to represent as a block of bytes.

Now, for the most part, C# makes it difficult to shoot yourself in the foot, but the language still allows you to do some pretty nasty things. Coupled with a ReadOnlySpan<T> constructor taking a pointer and a data length, this gave me the idea of returning the raw data pointer to the start of the bitmap image data.

Here is the updated GetResourceDataForCulture method:

private ReadOnlySpan<byte> GetResourceDataForCulture(ResourceType resourceType, int resourceID, int cultureID)
{
    var resourceSectionHeader = _sectionMap[DataDirectoryEntry.Resource];
    var l3CacheRef = _levelThreeCache[resourceType];

    if (!l3CacheRef.ContainsKey(resourceID) || (cultureID >= 0 && !l3CacheRef[resourceID].Any(x => x.CultureID == cultureID)))
        return new byte[0];

    if (cultureID < 0)
    {
        cultureID = l3CacheRef[resourceID].First().CultureID;
    }

    var resourceDataEntry = GetResourceDataEntryAtOffset(l3CacheRef[resourceID].First(x => x.CultureID == cultureID).Entry.OffsetToData);
    var resourceDataOffset = resourceSectionHeader.PointerToRawData + resourceDataEntry.OffsetToData - resourceSectionHeader.VirtualAddress;

    unsafe
    {
        byte* filePointer = null;
        _fileAccessor.SafeMemoryMappedViewHandle.AcquirePointer(ref filePointer);
        return new ReadOnlySpan<byte>((void*)((ulong)filePointer + resourceDataOffset), (int)resourceDataEntry.Size);
    }
}

Instead of allocating a new array and copying the data to it, we use ReadOnlySpan<byte> to point to the segment of the data in the memory-mapped file. Very cool!

For prepending the bitmap header, I did some additional digging on performant copies in C#, since there is really no way around copying the data in this case. Based on benchmarking, the fastest I found was Unsafe.CopyBlock. Here is the new PrependBitmapFileHeaderToResourceBytes with this change; note that the byte array is implicitly converted to Memory<byte>.

private unsafe Memory<byte> PrependBitmapFileHeaderToResourceBytes(ReadOnlySpan<byte> resourceBytes)
{
    var headerSize = BitConverter.ToInt32(resourceBytes.Slice(0, 4).ToArray(), 0);
    var bitmapHeaderBytes = resourceBytes.Slice(0, headerSize).ToArray();

    var totalFileSize = (uint)(resourceBytes.Length + BitmapFileHeader.BMP_FILE_HEADER_SIZE);
    var bitmapFileHeader = new BitmapFileHeader(totalFileSize, bitmapHeaderBytes).ToByteArray();

    var retArray = new byte[totalFileSize];
    fixed (byte* headerSource = bitmapFileHeader)
    fixed (byte* source = resourceBytes)
    fixed (byte* target = retArray)
    {
        Unsafe.CopyBlock(target, headerSource, BitmapFileHeader.BMP_FILE_HEADER_SIZE);
        Unsafe.CopyBlock(target + BitmapFileHeader.BMP_FILE_HEADER_SIZE, source, (uint)resourceBytes.Length);
    }

    return retArray;
}

This part probably could have kept using Array.Copy without much impact, but I figured if I’m already traveling down the path of unsafe code, I might as well go all out.

EndlessClient also had some updates to further reduce the number of data copies taking place once the image was loaded. Monogame 3.8.1 introduced a new Action<byte[]> processAction parameter to Texture2D.FromStream, allowing the caller to process the image data as it is loaded. This enabled me to completely drop SixLabors.ImageSharp, as a dependency, which by itself removes a data copy.

One problem remained. I have a Memory<byte> object, but I need to somehow turn this into a stream so it can be loaded by Texture2D.FromStream. I wanted to avoid calling ToArray() on the Memory<byte> and passing this to a MemoryStream because this would just re-introduce a data copy. Fortunately, Microsoft publishes a CommunityToolkit.HighPerformance package that includes an extension method AsStream, which creates a memory stream around the Memory<T> object without needing to copy anything.

Here is how EndlessClient now loads GFX:

private Texture2D LoadTexture(GFXTypes file, int resourceVal, bool transparent)
{
    var rawData = _gfxLoader.LoadGFX(file, resourceVal);

    if (rawData.IsEmpty)
        return new Texture2D(_graphicsDeviceProvider.GraphicsDevice, 1, 1);

    Action<byte[]> processAction = null;

    if (transparent)
    {
        // for all gfx: 0x000000 is transparent
        processAction = data => CrossPlatformMakeTransparent(data, Color.Black);

        // for hats: 0x080000 is transparent
        if (file == GFXTypes.FemaleHat || file == GFXTypes.MaleHat)
        {
            processAction = data => CrossPlatformMakeTransparent(data,
                // TODO: 0x000000 is supposed to clip hair below it
                new Color(0xff000000),
                new Color(0xff080000),
                new Color(0xff000800),
                new Color(0xff000008));
        }
    }

    using var ms = rawData.AsStream();
    var ret = Texture2D.FromStream(_graphicsDeviceProvider.GraphicsDevice, ms, processAction);

    return ret;
}

CrossPlatformMakeTransparent in this case iterates pixel by pixel over the image data and sets the alpha to zero for any pixels matching the input color set. The implementation I chose reads four bytes at a time (assuming 32bpp image data).

The first-pass implementation took a general approach of creating a color out of each pixel:

for (int i = 0; i < data.Length; i++)
{
    var color = Color.FromNonPremultiplied(data[i], data[i+1], data[i+2], data[i+3]);
    if (transparentColors.Contains(color))
        data[i+3] = 0;
}

Since I had been working with pointers, I wanted to (unnecessarily) further optimize this code. I ended up reading the array of bytes as packed 32-bit integers, converting those to colors, and setting the entire color to TransparentBlack where needed.

fixed (byte* ptr = data)
{
    for (int i = 0; i < data.Length; i += 4)
    {
        uint* addr = (uint*)(ptr + i);
        if (transparentColors.Contains(new Color(*addr)))
            *addr = 0;
    }
}

I don’t think this approach actually grants much in the way of performance gains, but over many images being processed, there will be a positive benefit.

EndlessClient and PE files#

The original approach#

Initializing the data#

Reading bitmap data#

Problem 1: Sequential Access#

Problem 2: Data copies#

The new approach#

Random file access#

Reducing data copies#