Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native dependency resolution problem for #r nuget #10136

Closed
dsyme opened this issue Sep 16, 2020 · 31 comments
Closed

Native dependency resolution problem for #r nuget #10136

dsyme opened this issue Sep 16, 2020 · 31 comments
Labels
Area-FSI Bug Impact-Medium (Internal MS Team use only) Describes an issue with moderate impact on existing code.
Milestone

Comments

@dsyme
Copy link
Contributor

dsyme commented Sep 16, 2020

Native DLLs are not being found in libtorch-cpu package which is referenced transitively from TorchSharp and DiffSharp

Analysis

Possible causes either

  1. There is no managed DLL in the libtorch-cpu package (and hence the native resolution logic decides it doesn't need to probe around in that package)

  2. There is a problem with transitive native dependency - in this repro, there are two native DLLs ("libLibTorchSharp.so" from TorchSharp and libtorch.so from libtorch-cpu)./ The first is being found but the load is failing due to the transitive reference on the second.

The relevant resolved transitive package versions for the repro are:

DiffSharp-cpu,1.0.0-preview-258177528
TorchSharp,0.3.52276
libtorch-cpu,1.5.6

Repro steps

#r "nuget: DiffSharp-cpu,1.0.0-preview-258177528";;

open DiffSharp
dsharp.config(backend=Backend.Torch)
let t = dsharp.tensor [ 0 .. 10 ];;

A simpler repro might be this (though I'm not certain NativeLibrary.Load triggers resolution using the handlers)

#r "nuget: libtorch-cpu,1.5.6";;
System.Runtime.InteropServices.NativeLibrary.Load("torch_cpu")

Expected behavior

This works

Actual behavior

System.DllNotFoundException: Unable to load DLL 'C:/Users/Administrator/.nuget/packages/torchsharp/0.3.52276/runtimes\win-x64\native\LibTorchSharp.dll' or one of its dependencies: The specified module could not be found. (0x8007007E)
   at System.Runtime.Loader.AssemblyLoadContext.InternalLoadUnmanagedDllFromPath(String unmanagedDllPath)
   at System.Runtime.Loader.AssemblyLoadContext.LoadUnmanagedDllFromPath(String unmanagedDllPath)
   at Microsoft.DotNet.DependencyManager.NativeAssemblyLoadContext.LoadNativeLibrary(String path) in C:\GitHub\dsyme\fsharp\src\fsharp\Microsoft.DotNet.DependencyManager\NativeDllResolveHandler.fs:line 46
   at Microsoft.DotNet.DependencyManager.NativeDllResolveHandlerCoreClr._resolveUnmanagedDll(Assembly _arg1, String name) in C:\GitHub\dsyme\fsharp\src\fsharp\Microsoft.DotNet.DependencyManager\NativeDllResolveHandler.fs:line 114
   at <StartupCode$Microsoft-DotNet-DependencyManager>.$NativeDllResolveHandler.-ctor@120-2.Invoke(Assembly delegateArg0, String delegateArg1) in C:\GitHub\dsyme\fsharp\src\fsharp\Microsoft.DotNet.DependencyManager\NativeDllResolveHandler.fs:line 120
   at System.Runtime.Loader.AssemblyLoadContext.GetResolvedUnmanagedDll(Assembly assembly, String unmanagedDllName)
   at System.Runtime.Loader.AssemblyLoadContext.ResolveUnmanagedDllUsingEvent(String unmanagedDllName, Assembly assembly, IntPtr gchManagedAssemblyLoadContext)
   at TorchSharp.Tensor.FloatTensor.THSTensor_newFloatScalar(Single scalar, Boolean requiresGrad)
   at TorchSharp.Tensor.FloatTensor.From(Single scalar, Boolean requiresGrad)
   at <StartupCode$DiffSharp-Backends-Torch>.$Torch.RawTensor.-ctor@900-1.Invoke(Single v)
   at DiffSharp.Backends.Torch.TorchStatics`2.CreateFromFlatArray(Array values, Int32[] shape, Device device)
   at DiffSharp.Tensor.create(Object value, FSharpOption`1 dtype, FSharpOption`1 device, FSharpOption`1 backend)
   at DiffSharp.dsharp.config(FSharpOption`1 dtype, FSharpOption`1 device, FSharpOption`1 backend)
   at <StartupCode$FSI_0003>.$FSI_0003.main@()

Known workarounds

This is a workaround to force the load of the native DLL that is not being found:

System.Runtime.InteropServices.NativeLibrary.Load(@"C:\Users\Administrator\.nuget\packages\libtorch-cpu\1.5.6\runtimes\win-x64\native\torch_cpu.dll");;

#r "nuget: DiffSharp-cpu,1.0.0-preview-258177528";;

open DiffSharp
dsharp.config(backend=Backend.Torch)
let t = dsharp.tensor [ 0 .. 10 ];;

On Linux:

System.Runtime.InteropServices.NativeLibrary.Load(@"/home/jovyan/.nuget/packages/libtorch-cpu/1.5.6/runtimes/linux-x64/native/libtorch.so")

Put together these are:

let path1 = System.IO.Path.GetDirectoryName(typeof<DiffSharp.dsharp>.Assembly.Location)
let path2 =
    if System.Runtime.InteropServices.RuntimeInformation.IsOSPlatform(System.Runtime.InteropServices.OSPlatform.Linux) then
       path1 + "/../../../../libtorch-cpu/1.5.6/runtimes/linux-x64/native/libtorch.so"
    else
       path1 + "/../../../../libtorch-cpu/1.5.6/runtimes/win-x64/native/torch_cpu.dll"
System.Runtime.InteropServices.NativeLibrary.Load(path2)

Related information

It's possible there is something wrong with the packages but this works when referenced from a project.

  • Operating system: both linux and windows
  • .NET Runtime kind (.NET Core, .NET Framework): .NET Core
@cartermp cartermp added this to the Backlog milestone Sep 16, 2020
@cartermp
Copy link
Contributor

Repros on RC1 for me

dsyme pushed a commit to dsyme/DiffSharp that referenced this issue Sep 16, 2020
dsyme added a commit to DiffSharp/DiffSharp that referenced this issue Sep 16, 2020
update torch getting started to workaround dotnet/fsharp#10136
@KevinRansom
Copy link
Member

@dsyme, this is really cool. I will see what I can do.

@KevinRansom
Copy link
Member

This repros quite nicely on Windows:
I get ...

> #r "nuget: DiffSharp-cpu,1.0.0-preview-258177528";;
[Loading C:\Users\codec\AppData\Local\Temp\nuget\4456--e5f48bb6-ed4f-4c75-ba1c-dba1ef125698\Project.fsproj.fsx]
namespace FSI_0002.Project

>
- open DiffSharp
- dsharp.config(backend=Backend.Torch)
- let t = dsharp.tensor [ 0 .. 10 ];;
Binding session to 'C:\Users\codec\.nuget\packages\diffsharp.core\1.0.0-preview-258177528\lib\netstandard2.1\DiffSharp.Core.dll'...
Binding session to 'C:\Users\codec\.nuget\packages\diffsharp.backends.torch\1.0.0-preview-258177528\lib\netcoreapp3.0\DiffSharp.Backends.Torch.dll'...
Binding session to 'C:\Users\codec\.nuget\packages\torchsharp\0.3.52276\lib\netcoreapp3.0\TorchSharp.dll'...
System.DllNotFoundException: Unable to load DLL 'C:/Users/codec/.nuget/packages/torchsharp/0.3.52276/runtimes\win-x64\native\LibTorchSharp.dll' or one of its dependencies: The specified module could not be found. (0x8007007E)
   at System.Runtime.Loader.AssemblyLoadContext.InternalLoadUnmanagedDllFromPath(String unmanagedDllPath)
   at System.Runtime.Loader.AssemblyLoadContext.LoadUnmanagedDllFromPath(String unmanagedDllPath)
   at Microsoft.DotNet.DependencyManager.NativeAssemblyLoadContext.LoadNativeLibrary(String path) in C:\kevinransom\fsharp\src\fsharp\Microsoft.DotNet.DependencyManager\NativeDllResolveHandler.fs:line 46
   at Microsoft.DotNet.DependencyManager.NativeDllResolveHandlerCoreClr._resolveUnmanagedDll(Assembly _arg1, String name) in C:\kevinransom\fsharp\src\fsharp\Microsoft.DotNet.DependencyManager\NativeDllResolveHandler.fs:line 114
   at <StartupCode$Microsoft-DotNet-DependencyManager>.$NativeDllResolveHandler.-ctor@120-2.Invoke(Assembly delegateArg0, String delegateArg1) in C:\kevinransom\fsharp\src\fsharp\Microsoft.DotNet.DependencyManager\NativeDllResolveHandler.fs:line 120
   at System.Runtime.Loader.AssemblyLoadContext.GetResolvedUnmanagedDll(Assembly assembly, String unmanagedDllName)
   at System.Runtime.Loader.AssemblyLoadContext.ResolveUnmanagedDllUsingEvent(String unmanagedDllName, Assembly assembly, IntPtr gchManagedAssemblyLoadContext)
   at TorchSharp.Tensor.FloatTensor.THSTensor_newFloatScalar(Single scalar, Boolean requiresGrad)
   at TorchSharp.Tensor.FloatTensor.From(Single scalar, Boolean requiresGrad)
   at <StartupCode$DiffSharp-Backends-Torch>.$Torch.RawTensor.-ctor@900-1.Invoke(Single v)
   at DiffSharp.Backends.Torch.TorchStatics`2.CreateFromFlatArray(Array values, Int32[] shape, Device device)
   at DiffSharp.Tensor.create(Object value, FSharpOption`1 dtype, FSharpOption`1 device, FSharpOption`1 backend)
   at DiffSharp.dsharp.config(FSharpOption`1 dtype, FSharpOption`1 device, FSharpOption`1 backend)
   at <StartupCode$FSI_0003>.$FSI_0003.main@()
Stopped due to error
>
-

@KevinRansom
Copy link
Member

So ...transitive native dependencies are not resolvable using the native resolution event handler. I am fairly confident that only works for managed code that has native dependencies. It looks to me like the design was primarily designed to make pinvoke work.

@KevinRansom
Copy link
Member

@jkotas

With the AssemblyLoadContext.ResolvingUnmanagedDll event handler, we are not being notified for

native dll load attempts that are caused by a native dependency of a native library on either windows or linux. Is that "ByDesign" or is there a mechanism we can use that will allow us to detect attempts to transitively load native .dlls.

In our example above we have a managed library
that loads a native library 'LibTorchSharp.dll'
Which itself has a native dependency to a library: torch_cpu.dll

We get notified to locate the torchsharp dependency but not the torch_cpu.dll one.

Not that it will be much help but our handler is here: https://github.com/dotnet/fsharp/blob/main/src/fsharp/Microsoft.DotNet.DependencyManager/NativeDllResolveHandler.fs#L89

@jkotas
Copy link
Member

jkotas commented Sep 18, 2020

It is by design.

The native library loader is part of OS. It does not expose events to resolve dependencies like this.

Different OSes provide assorted OS-specific mechanisms to help with this scenarios. For example, there is SetDllDirectoryW on Windows or RPATH on Unix.

@KevinRansom
Copy link
Member

@jkotas , thanks mate, that was what I expected.

@KevinRansom
Copy link
Member

KevinRansom commented Sep 18, 2020

@dsyme - if you are okay bundling the libtorch native libs with torchsharp then it will work fine:

It produced this.

c:\kevinransom\fsharp>dotnet artifacts\bin\fsi\Debug\netcoreapp3.1\fsi.exe --langversion:preview

Microsoft (R) F# Interactive version 11.0.0.0 for F# 5.0
Copyright (c) Microsoft Corporation. All Rights Reserved.

For help type #help;;

> #r "nuget: DiffSharp-cpu,1.0.0-preview-258177528";;
[Loading C:\Users\codec\AppData\Local\Temp\nuget\12316--989cf7ca-6ba9-4aab-a922-2bf875d5a299\Project.fsproj.fsx]
namespace FSI_0002.Project

>
- open DiffSharp
- dsharp.config(backend=Backend.Torch)
- let t = dsharp.tensor [ 0 .. 10 ];;
Binding session to 'C:\Users\codec\.nuget\packages\diffsharp.core\1.0.0-preview-258177528\lib\netstandard2.1\DiffSharp.Core.dll'...
Binding session to 'C:\Users\codec\.nuget\packages\diffsharp.backends.torch\1.0.0-preview-258177528\lib\netcoreapp3.0\DiffSharp.Backends.Torch.dll'...
Binding session to 'C:\Users\codec\.nuget\packages\torchsharp\0.3.52276\lib\netcoreapp3.0\TorchSharp.dll'...
val t : DiffSharp.Tensor =
  Tensor
    [0.000000, 1.000000, 2.000000, 3.000000, 4.000000, 5.000000, 6.000000, 7.000000, 8.000000, 9.000000, 10.000000]

>

I won't submit a pr, i'm not sure how your build works.

@cartermp
Copy link
Contributor

Closing as external

@dsyme
Copy link
Contributor Author

dsyme commented Sep 18, 2020

OK, thank you, I'll find some kind of resolution.

@dsyme
Copy link
Contributor Author

dsyme commented Sep 18, 2020

@cartermp The packages work OK with applications. From what I see there is nothing wrong with the packages as such - the problem is with our dynamic loader, which doesn't handle transitive native references. (Applications handle this by copying all native DLLs to the one directory on build)

I'm not saying it's easy to fix, but it feels like the problem is with us, and could hit us with any packages that rely on transitive native dependencies, so I'll reopen the bug if that's ok.

That said I will try to find a workaround to arrange the TorchSharp native packages so they are non-transitive.

Different OSes provide assorted OS-specific mechanisms to help with this scenarios. For example, there is SetDllDirectoryW on Windows or RPATH on Unix.

@KevinRansom Given that in F#/.NET Interactive we are loading DLLs directly from the package directories, it does kind of feel like we should be using these mecahnisms to augment the native loader load paths. Hard to see any other systematic way to solve this

@jkotas Did you mean AddDllDirectory?

@dsyme - if you are okay bundling the libtorch native libs with torchsharp then it will work fine:

@KevinRansom Unfortunately this is not a practical solution.

  1. There are multiple different runtime native DLLs that work with the same managed DLL - basically CPU and GPU - the end application selects one

  2. The collected native DLLs are too large to fit in one nuget package - they are about 1.5GB for GPU for example. So they must be delivered in multiple packages, because in practice both nuget.org and Azure CI and other things place limits on nuget package size around 200MB.

Tricky problem

@dsyme dsyme reopened this Sep 18, 2020
@dsyme dsyme added Area-FSI Bug Impact-Medium (Internal MS Team use only) Describes an issue with moderate impact on existing code. and removed Area-External Resolution-External labels Sep 18, 2020
@dsyme
Copy link
Contributor Author

dsyme commented Sep 18, 2020

I've documented this from the TorchSharp perspective here: dotnet/TorchSharp#169

(I can see that we're not going to make this a high-priority thing for .NET Interactive and F# Interactive unless we hit other packages that have transitive native references.)

@cartermp
Copy link
Contributor

Yes, this feels like a very niche thing that is low severity

@jkotas
Copy link
Member

jkotas commented Sep 18, 2020

@jkotas Did you mean AddDllDirectory?

You are right. AddDllDirectory would be more appropriate for this.

@KevinRansom
Copy link
Member

@dsyme, it shouldn't be too hard to make a change to also use this mechanism, l will put something together, hopefully over the weekend. You can let me know if it works for.

So ... my linux is not great, however, it seems that rpath is a string embedded into the library that has a dependency. This would require TorchSharp to embed this string for Linux, and to the best of my knowledge the Windows dll loader has no equivalent, so we would still need a windows solution.

The linux equivalent of AddDllDirectory is probably LD_LIBRARY_PATH. Which I can set after package resolution, but before dll load. Because it is an environment variable, if developers use fsi to spawn new processes they are also going to see this variable, which is somewhat dll hellish. Although I suppose I could swap it in before we do the load, and back out afterwards. Given that it is a dll load operation, that is bound to be vastly more expensive than swapping out an environment variable.

@jkotas , @dsyme could I ask you both to check my PR, if an when I implement it, and see if it is not too terrible. Thanks

Kevin

@dsyme
Copy link
Contributor Author

dsyme commented Sep 18, 2020

@dsyme, it shouldn't be too hard to make a change to also use this mechanism, l will put something together, hopefully over the weekend. You can let me know if it works for.

It's ok, don't worry. I've come to the conclusion that all these native DLLs need to be in the same directory anyway. They register the "torch implementation directory" in a common registry in some way, and it looks like they all have to be in the same place otherwise we get whacky errors like "Key already registered with the same priority: GroupSpatialSoftmax"

I'll think about what to do. Awkward but hey

@KevinRansom
Copy link
Member

Okay mate, may I close this issue?

@jkotas
Copy link
Member

jkotas commented Sep 18, 2020

it is an environment variable, if developers use fsi to spawn new processes they are also going to see this variable, which is somewhat dll hellish.

Also, setting process environment variables is not thread safe on Unix that comes with its own set of problems...

@baronfel
Copy link
Member

For really off-the-wall shenanigans, from what I'm reading you could use patchelf to change the rpath for a binary before loading it as well. You'd probably want to do some kind of shadow-copy of the binary so that you could munge it without clobbering, though.

@KevinRansom
Copy link
Member

@baronfel, lol. That would be super cool but we would prefer not to copy files in #r, if we copied files, we would shove them in the same directory, and wouldn't have an issue. I am sort of thinking of adding an option that will do that for these really tricky scenarios. However, Don doesn't need it anymore so I'm not going to rush to do something, even real cool nerdy stuff :-)

@dsyme
Copy link
Contributor Author

dsyme commented Sep 18, 2020

Okay mate, may I close this issue?

This is still, I think, in some sense a bug in the F# and .NET Interactive loading experience of packages. I think we can only close the issue if we document the limitation.

Are these docs up-to-date? They look a bit dated at first glance? https://github.com/fsharp/fslang-design/blob/master/tooling/FST-1027-fsi-references.md

@KevinRansom
Copy link
Member

They don't discuss native dependency resolution at all. If we think it's a bug, I can prepare a fix, I would rather fix it now, than in two years time, when I've forgotten all of this stuff.

@dsyme
Copy link
Contributor Author

dsyme commented Sep 18, 2020

Could you please update the specs to include information on #r for packages with native dependencies ? The specs should say what is meant to work and what isn't - right now it's a little hard to tell what the intended spec is.

Here's an approximate spec, maybe you can work from this?


Spec: Dynamic loading of packages containing native DLLs

Dynamic loading of packages containing native DLLs is supported by adding an event handler to AssemblyLoadContext.Default.ResolvingUnmanagedDll, which is triggered when resolving an unmanaged assembly in the context of a .NET assembly (e.g. a DllImport).

This handler consults current architecture and platform settings plus resolved package metadata and files across all dynamically referenced packages to look for a matching native DLL and then dynamically loads that DLL using an internal NativeAssemblyLoadContext that implements LoadNativeLibrary via LoadUnmanagedDllFromPath.

This process is not triggered for transitive native-to-native references, which are resolved with respect to the native DLL using standard rules of the operating system. Normally this means any transitive native dependencies must sit next to the native DLL at time of load.


That spec seems pretty clear to me and I'm pretty sure we shouldn't rush to use any native library loading functionality that .NET doesn't provide (even if that means we can't reasonable support transitive loading of native components with native-to-native references across multiple packages). It's just a total can of worms.

I'm just going to have to work out an approach that works for these horrific Torch native components. Most the complexity is to do with the vast size of the native components involved. So much AI.

@KevinRansom
Copy link
Member

KevinRansom commented Sep 18, 2020

@dsyme, I can add a switch that copies files to a single directory on resolution. Sort of like publish lite. That would take care of your issues. And mean you don't have to deal with the size issue. It wouldn't be the default it would be opt-in so normally scripts wouldn't have to deal with the issues.

It would also mean that we have an approach for transitive package dependencies, for when project build works and scripting fails. What do you think?

Kevin

@dsyme
Copy link
Contributor Author

dsyme commented Sep 19, 2020

@dsyme, I can add a switch that copies files to a single directory on resolution. Sort of like publish lite. That would take care of your issues. And mean you don't have to deal with the size issue. It wouldn't be the default it would be opt-in so normally scripts wouldn't have to deal with the issues.

For my use cases the problem is that this doesn't deal with size - we would end up consuming 1-2GB of copy and storage each .NET/F# interactive invocation (when running on the GPU - less for CPU Torch binaries), which is a significant pause time in itself. These native binaries are just vast (even if they don't all get paged in).

I think this is such a special case that we should just settle on where we are at the moment - with the spec above - and find some workaround for TorchSharp.

@KevinRansom
Copy link
Member

OK

@KevinRansom
Copy link
Member

So for now this is by design, and Don updated the rfc to not the transitive native dependency limitation.

@cartermp cartermp modified the milestones: Backlog, 16.8 Sep 25, 2020
@fwaris
Copy link

fwaris commented Nov 28, 2020

One workaround I have used in the past is to add the native dlls to the system.environment "Path" variable, dynamically in the script.

Here is a snippet I have used to reference native dlls for ML.Net in the past:

open System
let path = Environment.GetEnvironmentVariable("path")

let path' = 
    path 
    + ";" + "c:\users\admin\.nuget\packages\microsoft.ml\1.5.2\runtimes\win-x64\native\LdaNative.dll" 
    + ";" + "c:\users\admin\.nuget\packages\microsoft.ml.cpumath\1.5.2\runtimes\win-x64\nativeassets\netstandard2.0\CpuMathNative.dll" 
    + ";" + "c:\users\admin\.nuget\packages\microsoft.ml.fasttree\1.5.2\runtimes\win-x64\native\FastTreeNative.dll" 
    + ";" + "c:\users\admin\.nuget\packages\microsoft.ml.mkl.components\1.5.2\runtimes\win-x64\native\SymSgdNative.dll" 
    + ";" + "c:\users\admin\.nuget\packages\microsoft.ml.recommender\0.17.2\runtimes\win-x64\native\MatrixFactorizationNative.dll" 

Environment.SetEnvironmentVariable("path",path')

Note that adding the full path to the dll works fine. Maybe just adding directories would also be ok but I have not tested.

Also now with "nuget: ..." style references I don't have to do the above except for one dll that is under the "nativeassets\netstandard2.0" directory.

@KevinRansom
Copy link
Member

@fwaris thanks, nativeassets was a new one on me. I will update probing to support it:

NuGet/Home#2782
NuGet/Home#3027 (comment)
NuGet/NuGet.Client@aed1d51

@fwaris
Copy link

fwaris commented Dec 1, 2020

@KevinRansom FYI I just built a recommender model in ML.Net with FSI packaged with vs2019 preview. FSI could not find the MatrixFactorizationNative.dll. I had to add the directory to the "PATH" variable. Here is the script to load the packages and set the environment that worked for me:

#r "nuget: Microsoft.ML.AutoML, Version=0.17.2" 
#r "nuget: Microsoft.ML.Recommender"

let userProfile = System.Environment.GetEnvironmentVariable("UserProfile")
let packageRoot = $@"{userProfile}\.nuget\packages"
let nativeLib =  $@"{packageRoot}\microsoft.ml.cpumath\1.5.2\runtimes\win-x64\nativeassets\netstandard2.0"//CpuMathNative.dll"
let nativeLib2 = $@"{packageRoot}\microsoft.ml.recommender\0.17.2\runtimes\win-x64\native"//MatrixFactorizationNative.dll"
let path = System.Environment.GetEnvironmentVariable("path")
let path' =  path + ";" + nativeLib + ";" + nativeLib2
System.Environment.SetEnvironmentVariable("path",path')

@KevinRansom
Copy link
Member

Thanks for the information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-FSI Bug Impact-Medium (Internal MS Team use only) Describes an issue with moderate impact on existing code.
Projects
None yet
Development

No branches or pull requests

6 participants