I always tell myself that I’ll try posting more frequently on my blog, and yet here I am, two years later. Perhaps this post will provide the necessary motiviation to conduct more public research. I do love it.
This post details a novel remote code injection technique I discovered while playing around with delay loading DLLs. It allows for the injection of arbitrary code into arbitrary remote, running processes, provided that they implement the abused functionality. To make it abundantly clear, this is not an exploit, it’s simply another strategy for migrating into other processes.
Modern code injection techniques typically rely on a variation of two different win32 API calls: CreateRemoteThread and NtQueueApc. Endgame recently put out a great article detailing ten various methods of process injection. While not all of them allow for injection into remote processes, particularly those already running, it does detail the most common, public variations. This strategy is more akin to inline hooking, though we’re not touching the IAT and we don’t require our code to already be in the process. There are no calls to NtQueueApc or CreateRemoteThread, and no need for thread or process suspension. There are some limitations, as with anything, which I’ll detail below.
Delay Load DLL
Delay loading is a linker strategy that allows for the lazy loading of DLLs. Executables commonly load all necessary dynamically linked libraries at runtime and perform the IAT fix-ups then. Delay loading, however, allows for these libraries to be lazy loaded at call time, supported by a pseudo IAT that’s fixed-up on first call. This process can be better illuminated by the following, decades old figure below:
This image comes from a great Microsoft article released in 1998  that describes the strategy quite well, but I’ll attempt to distill it here.
Portable executables contain a data directory named
IMAGE_DIRECTORY_ENTRY_DELAY_IMPORT, which you can see using
or using windbg. The structure of this entry is described in delayhlp.cpp,
included with the WinSDK:
1 2 3 4 5 6 7 8 9 10 11
The table itself contains RVAs, not pointers. We can find the delay directory offset by parsing the file header:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
The first entry and it’s delay linked DLL can be seen in the following:
1 2 3 4 5
This means that WINMM is dynamically linked to explorer.exe, but delay loaded, and will not be loaded into the process until the imported function is invoked. Once loaded, a helper function fixes up the psuedo IAT by using GetProcAddress to locate the desired function and patching the table at runtime.
The pseudo IAT referenced is separate from the standard PE IAT; this IAT is specifically for the delay load functions, and is referenced from the delay descriptor. So for example, in WINMM.dll’s case, the pseudo IAT for WINMM is at RVA 000b1000. The second delay descriptor entry would have a separate RVA for its pseudo IAT, and so on and so forth.
Using WINMM as our delay example, explorer imports one function from it,
In my particular running instance, it has not been invoked, so the pseudo IAT
has not been fixed up yet. We can see this by dumping it’s pseudo IAT entry:
1 2 3
Each DLL entry is null terminated. The above pointer shows us that the existing entry is merely a springboard thunk within the Explorer process. This takes us here:
1 2 3 4 5 6 7 8 9 10
The tailMerge function is a linker-generated stub that’s compiled in per-DLL,
not per function. The
__delayLoadHelper2 function is the magic that
handles the loading and patching of the pseudo IAT. Documented in delayhlp.cpp,
this function handles calling LoadLibrary/GetProcAddress and patching the
pseudo IAT. As a demonstration of how this looks, I compiled a binary that
delay links dnslib. Here’s the process of resolution of
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Now the pseudo IAT entry has been patched up and the correct function is invoked on subsequent calls. This has the additional side effect of leaving the pseudo IAT as both executable and writable:
1 2 3 4
At this point, the DLL has been loaded into the process and the pseudo IAT patched up. In another additional twist, not all functions are resolved on load, only the one that is invoked. This leaves certain entries in the pseudo IAT in a mixed state:
1 2 3 4 5 6 7
In the above, two of the four functions are resolved and the DUI70.dll library is loaded into the process. In each entry of the delay load descriptor, the structure referenced above maintains an RVA to the HMODULE. If the module isn’t loaded, it will be null. So when a delayed function is invoked that’s already loaded, the delay helper function will check it’s entry to determine if a handle to it can be used:
1 2 3 4 5 6 7 8
idd structure is just an instance of the InternalImgDelayDescr described
above and passed into the
__delayLoadHelper2 function from the linker
tailMerge stub. So if the module is already loaded, as referenced from delay
entry, then it uses that handle instead. It does NOT attempt to LoadLibrary
irregardless of this value; this can be used to our advantage.
Another note here is that the delay loader supports notification hooks. There are six states we can hook into: processing start, pre load library, fail load library, pre GetProcAddress, fail GetProcAddress, and end processing. You can see how the hooks are used in the above code sample.
Finally, in addition to delay loading, the portable executable also supports delay library unloading. It works pretty much how you’d expect it, so we won’t be touching on it here.
Before detailing how we might abuse this (though it should be fairly obvious), it’s important to note the limitations of this technique. It is not completely portable, and using pure delay load functionality it cannot be made to be so.
The glaring limitation is that the technique requires the remote process to be delay linked. A brief crawl of some local processes on my host shows many Microsoft applications are: dwm, explorer, cmd. Many non-Microsoft applications are as well, including Chrome. It is additionally a well supported function of the portable executable, and exists today on modern systems.
Another limitation is that, because at it’s core it relies on LoadLibrary, there must exist a DLL on disk. There is no way to LoadLibrary from memory (unless you use one of the countless techniques to do that, but none of which use LoadLibrary…).
In addition to implementing the delay load, the remote process must implement functionality that can be triggered. Instead of doing a CreateRemoteThread, SendNotifyMessage, or ResumeThread, we rely on the fetch to the pseudo IAT, and thus we must be able to trigger the remote process into performing this action/executing this function. This is generally pretty easy if you’re using the suspended process/new process strategy, but may not be trivial on running applications.
Finally, any process that does not allow unsigned libraries to be loaded will block this technique. This is controlled by ProcessSignaturePolicy and can be set with SetProcessMitigationPolicy; it is unclear how many apps are using this at the moment, but Microsoft Edge was one of the first big products to be employing this policy. This technique is also impacted by the ProcessImageLoadPolicy policy, which can be set to restrict loading of images from a UNC share.
When discussing an ability to inject code into a process, there are three separate cases an attacker may consider, and some additional edge situations within remote processes. Local process injection is simply the execution of shellcode/arbitrary code within the current process. Suspended process is the act of spawning a new, suspended process from an existing, controlled one and injecting code into it. This is a fairly common strategy to employ for migrating code, setting up backup connections, or establishing a known process state prior to injection. The final case is the running remote process.
The running remote process is an interesting case with several caveats that we’ll explore below. I won’t detail suspended processes, as it’s essentially the same as a running process, but easier. It’s easier because many applications actually just load the delay library at runtime, either because the functionality is environmentally keyed and required then, or because another loaded DLL is linked against it and requires it. Refer to the source code for the project for an implementation of suspended process injection .
The local process is the most simple and arguably the most useless for this strategy. If we can inject and execute code in this manner, we might as well link against the library we want to use. It serves as a fine introduction to the topic, though.
The first thing we need to do is delay link the executable against something.
For various reasons I originally chose
dnsapi.dll. You can specify delay
load DLLs via the linker options for Visual Studio.
With that, we need to obtain the RVA for the delay directory. This can be accomplished with the following function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Should be pretty clear what we’re doing here. Once we’ve got the correct table entry, we need to mark the entry’s DllName as writable, overwrite it with our custom DLL name, and restore the protection mask:
1 2 3 4 5
Now all that’s left to do is trigger the targeted function. Once triggered, the delay helper function will snag the DllName from the table entry and load the DLL via LoadLibrary.
The most interesting of cases is the running remote process. For demonstration here, we’ll be targeting explorer.exe, as we can almost always rely on it to be running on a workstation under the current user.
With an open handle to the explorer process, we must perform the same searching tasks as we did for the local process, but this time in a remote process. This is a little more cumbersome, but the code can be found in the project repository for reference. We simply grab the remote PEB, parse the image and it’s directories, and locate the appropriate delay entry we’re targeting.
This part is likely to prove the most unfriendly when attempting to port this
to another process; what functionality are we targeting? What function or
delay load entry is generally unused, but triggerable from the current session?
With explorer there are several options; it’s delay linked against 9 different
DLLs, each averaging 2-3 imported functions. Thankfully one of the first
functions I looked at was pretty straightforward:
function, exported by
CFGMGR32.dll, requests that the system be ejected from
the local docking station. We can therefore assume that it’s likely to be
available and not fixed on workstations, and potentially unfixed on laptops,
should the user never explicitly request the system to be ejected.
When we request for the workstation to be ejected from the docking station, the function sends a PNP request. We use the IShellDispatch object to execute this, which is accessed via Shell, handled by, you guessed it, explorer.
The code for this is pretty simple:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Our DLL only needs to export
CM_Request_Eject_PC for us to not crash the
process; we can either pass on the request to the real DLL, or simply ignore
it. This leads us to stable and reliable remote code injection.
Remote Process – All Fixed
One interesting edge case is a remote process that you want to inject into via delay loading, but all imported functions have been resolved in the pseudo IAT. This is a little more complicated, but all hope is not lost.
Remember when I mentioned earlier that a handle to the delay load library is maintained in its descriptor? This is the value that the helper function checks for to determine if it should reload the module or not; if it’s null, it attempts to load it, if it’s not, it uses that handle. We can abuse this check by nulling out the module handle, thereby “tricking” the helper function into once again loading that descriptor’s DLL.
In the discussed case, however, the pseudo IAT is all patched up; no more trampolines into the delay load helper function. Helpfully the pseudo IAT is writable by default, so we can simply patch in the trampoline function ourselves and have it instantiate the descriptor all over again. In short, this worst-case strategy requires three separate WriteProcessMemory calls: one to null out the module handle, one to overwrite the pseudo IAT entry, and one to overwrite the loaded DLL name.
I should make mention that I tested this strategy across several next gen AV/HIPS appliances, which will go unnamed here, and none where able to detect the cross process injection strategy. It would seem overall to be an interesting challenge at detection; in remote processes, the strategy uses the following chain of calls:
1 2 3 4 5 6 7 8
That’s it. The trigger functionality would be dynamic among each process, and the loaded library would be loaded via supported and well-known Windows facilities. I checked out a few other core Windows applications, and they all have pretty straightforward trigger strategies.
The referenced project includes both x86 and x64 support, and has been
tested across Windows 7, 8.1, and 10. It includes three functions of interest:
inject_local, inject_suspended, and inject_explorer. It expects to find
the DLL at
C:\Windows\Temp\TestDLL.dll, but this can obviously be changed.
Note that it isn’t production quality; beware, here be dragons.
Special thanks to Stephen Breen for reviewing this post