Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce string allocation from TextFactory #14531

Merged
merged 4 commits into from
Oct 17, 2016

Conversation

heejaechang
Copy link
Contributor

@asecchia adrian found an issue where we using editor text buffer for closed file allocates way more than using compiler's implementation of SourceText.

also, default implementation of TextFactory OOP uses can benefit from it by not allocating string as much as before for a big file.

@heejaechang
Copy link
Contributor Author

@tmat Tomas, will this work?


chunks.Add(chunk);
private static ImmutableArray<char[]> ReadFromTextRead(TextReader reader, int maxCharRemainingGuess, bool throwIfBinaryDetected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadFromTextReader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

@tmat
Copy link
Member

tmat commented Oct 14, 2016

@nguerrera worked in this area recently.

@@ -97,6 +97,37 @@ public static SourceText From(string text, Encoding encoding = null, SourceHashA
return new StringText(text, encoding, checksumAlgorithm: checksumAlgorithm);
}

#pragma warning disable RS0026 // Do not add multiple public overloads with optional parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are already some global suppressions for other overloads. Consider doing the same for this one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we even have this warning if we suppress it every time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do. I just wanted quick check from you guys whether this is even okay to do.

is checksum and the embedded blob thing not there okay? I think I am using encoding wrong though. since I am getting TextReader, it should be already normalized to certain encoding, the given encoding should be just passed along like StringText.From(string, encoding), and not actually used anywhere else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think TextReader is morally equivalent to string here: you get chars not bytes out of it. As such, it is OK to defer making the embedded blob and to do so by re-encoding it using the given encoding. It would be good to add a FromSource_TextReader test to EmbeddedTextTests. It should work as well as passing the same text to the string overload.


In reply to: 83501039 [](ancestors = 83501039)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good warning, but this class broke the rule already so suppressing to stay consistent. It is incredibly frustrating to have all of these overloads (which I observed while writing tests). For EmbeddedText.FromXxx equivalents, the warning helped me to make separate FromBytes, FromStream, etc. factories. So I think the warning has value. Maybe we should just put a disable/restore pair aorund the full set of From overloads here and not suppress every time we add a new one. (Hopefully we won't add more, though).


In reply to: 83500492 [](ancestors = 83500492)

@heejaechang heejaechang changed the title [Review Only] reduce string allocation from TextFactory Reduce string allocation from TextFactory Oct 14, 2016
@heejaechang
Copy link
Contributor Author

@tmat @nguerrera @CyrusNajmabadi updated PR to address all feedbacks and added unit tests

allow SourceText From TextReader to have null encoding lke the  one from string.
@heejaechang
Copy link
Contributor Author

@gafter I would like to add SourceText.From which accepts TextReader as input for content. this is to reduce various allocations of string, char[] and etc.

// We must compute the checksum and embedded text blob now while we still have the original bytes in hand.
// We cannot re-encode to obtain checksum and blob as the encoding is not guaranteed to round-trip.
var checksum = CalculateChecksum(stream, checksumAlgorithm);
var embeddedTextBlob = canBeEmbedded ? EmbeddedText.CreateBlob(stream) : default(ImmutableArray<byte>);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the 'embeddedTextBlob'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nguerrera is probably better person to answer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the blob that encodes (optionally) embedded source in a PDB. See #12625 for design info.

}
internal static SourceText Decode(TextReader reader, int length, Encoding encodingOpt, SourceHashAlgorithm checksumAlgorithm)
{
if (length == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put a comment mentioning why this optimization is necessary? i.e. what percentage of readers have length=0 and how much this saves us to optimize. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised we have any callers calling LargeText.Decode witha length of 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dotnet/roslyn-compiler compiler team is probably the one should answer.

{
Array.Resize(ref chunk, charsRead);
}
return new LargeText(chunks, encodingOpt, checksumAlgorithm);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 'LargeText'? 'Large' implies that it is used for, well, large inputs. But what if the length was only '1', would we still make a "Large" text?

Or, if LargeText is for texts of any size, perhaps its name should be changed. Maybe ChunkedSourceText?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah,n/m. This is LargeText.Decode, not SourceText.Decode. So the lenght check was already done before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am following existing decode pattern. @dotnet/roslyn-compiler probably better people to answer

{
throw new InvalidDataException();
}
private static ImmutableArray<char[]> ReadFromTextReader(TextReader reader, int maxCharRemainingGuess, bool throwIfBinaryDetected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadChunksFromTextReader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

{
throw new InvalidDataException();
}
private static ImmutableArray<char[]> ReadFromTextReader(TextReader reader, int maxCharRemainingGuess, bool throwIfBinaryDetected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this return an ImmutableArray<ImmutableArray<char>>? Is there a value in returning mutable arrays? Are callers every going to change them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't know this was what it was doing before. but not sure converting char[] to immutable array is worth anything since we need char[] to communicate any other .net api.


if (charsRead < chunk.Length)
{
Array.Resize(ref chunk, charsRead);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this resizing will not actually change 'chunk', it will allocate another array, copy the data inot it, then point 'ref chunk' at that new array. Given that, do we want to pool chunks so that we don't leak them in the case where we need to resize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe there is any pool here for chunk.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add one :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is data that shows it help. but I doubt that will help since the chunk will live as long as source text live and the pool will statically alive in memory and the size should be big assuming it is for large text.

@@ -337,6 +348,16 @@ public Task WriteStreamAsync(Stream stream, CancellationToken cancellationToken
}
}
}

internal class TemporaryStorageTextReader : TextReader
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There only appears to be one subclass of this type. Why not just roll this into DirectMemoryAccessReader?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

@@ -813,6 +813,7 @@ static Microsoft.CodeAnalysis.SeparatedSyntaxList<TNode>.implicit operator Micro
static Microsoft.CodeAnalysis.SeparatedSyntaxList<TNode>.implicit operator Microsoft.CodeAnalysis.SeparatedSyntaxList<TNode>(Microsoft.CodeAnalysis.SeparatedSyntaxList<Microsoft.CodeAnalysis.SyntaxNode> nodes) -> Microsoft.CodeAnalysis.SeparatedSyntaxList<TNode>
static Microsoft.CodeAnalysis.Text.SourceText.From(System.IO.Stream stream, System.Text.Encoding encoding = null, Microsoft.CodeAnalysis.Text.SourceHashAlgorithm checksumAlgorithm = Microsoft.CodeAnalysis.Text.SourceHashAlgorithm.Sha1, bool throwIfBinaryDetected = false, bool canBeEmbedded = false) -> Microsoft.CodeAnalysis.Text.SourceText
static Microsoft.CodeAnalysis.Text.SourceText.From(System.IO.Stream stream, System.Text.Encoding encoding, Microsoft.CodeAnalysis.Text.SourceHashAlgorithm checksumAlgorithm, bool throwIfBinaryDetected) -> Microsoft.CodeAnalysis.Text.SourceText
static Microsoft.CodeAnalysis.Text.SourceText.From(System.IO.TextReader reader, int length, System.Text.Encoding encoding = null, Microsoft.CodeAnalysis.Text.SourceHashAlgorithm checksumAlgorithm = Microsoft.CodeAnalysis.Text.SourceHashAlgorithm.Sha1) -> Microsoft.CodeAnalysis.Text.SourceText
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @AnthonyDGreen, @jaredpar, @gafter For new public API.

@AlekseyTs
Copy link
Contributor

LGTM

1 similar comment
@nguerrera
Copy link
Contributor

LGTM

@heejaechang heejaechang merged commit 779bc9a into dotnet:master Oct 17, 2016
heejaechang added a commit to heejaechang/roslyn that referenced this pull request Oct 21, 2016
Reduce string allocation from TextFactory
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants