Skip to content

Commit

Permalink
Documentation page on collations and case-sensitivity
Browse files Browse the repository at this point in the history
Closes #2273
  • Loading branch information
roji committed Apr 27, 2020
1 parent 804349c commit f493c2f
Show file tree
Hide file tree
Showing 5 changed files with 150 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
---
title: Collations and case sensitivity - EF Core
author: roji
ms.date: 27/04/2020
ms.assetid: bde4e0ee-fba3-4813-a849-27049323d301
uid: core/miscellaneous/collations-and-case-sensitivity.md
---
# Collations and Case Sensitivity

> [!NOTE]
> The APIs shown in this page are being introduced in EF Core 5.0, which is still in preview.
Text processing in databases can be a complex issues, that requires more user attention that one would suspect. For one thing, databases vary considerably in how they handle text; for example, while some databases are case-sensitive by default (e.g. Sqlite, PostgreSQL), others are case-insensitive (SQL Server, MySQL). In addition, because of index use, case-sensitivity and similar aspects can have a far-reaching impact on query performance: while it may be tempting to use `string.Lower` to force a case-insensitive comparison in a case-sensitive database, doing so may prevent your application from using indexes. This page details how to configure case sensitivity, or more generally, collations, and how to do so in an efficient way without compromising query performance.

## Introduction to collations

A fundamental concept in text processing is the *collation*, which is a set of rules determining how text values are ordered and compared for equality. For example, while a case-insensitive collation disregards differences between upper- and lower-case letters for the purposes of equality comparison, a case-sensitive collation does not. However, case-sensitivity is cultural-sensitive: while `i` and `I` are typically upper- and lower-case versions of the same letter, in the Turkish language they are two different letters; as a result, there exist many different case-insensitive collations, each with its own set of rules. The scope of collations also extends beyond case-sensitivity, to other aspects of character data; in German, for example, it is sometimes (but not always) desirable to treat `ä` and `ae` as identical. Finally, collations also define how text values are *ordered*: while German places `ä` after `a`, Swedish places it at the end of the alphabet.

All text operations in a database use a collation - whether explicitly or implicitly - which defines how the operation compares and orders strings. The actual list of available collations as well as their management is database-specific, as are their names; consult [the section below](#provider-specific) for links to relevant documentation pages of various databases. Fortunately, database are quite uniform in allowing a default collation to be defined at the database or column level, and to explicitly specify which collation should be use for specific operations in a query.

## Database collation

In most database systems, a default collation is defined at the database level; unless overridden, that collation implicitly applies to all text operations occurring within that database. The database collation is typically set at database creation time (via the `CREATE DATABASE` DDL statement), and if not specified, defaults to a some server-level value determined at setup time. For example, the default server-level collation in SQL Server is `SQL_Latin1_General_CP1_CI_AS`, which is a case-insensitive but accent-sensitive collation. Although database systems usually do allow altering the collation of an existing database, doing so frequently leads to complications; it is recommended to pick a collation before database creation.

EF Core allows you to specify the database collation. For example, the following configures a SQL Server database to use a case-sensitive collation:

[!code-csharp[Main](../../../samples/core/Miscellaneous/Collations/Program.cs?name=OnModelCreating&highlight=3)]

## Column collation

Collations can also be defined on text columns, overriding the database default. This can be useful if, for example, certain columns need to be case-insensitive, while the rest of the database needs to be case-sensitive.

The following configures the column for the `Name` property to use the German Phonebook collation:

[!code-csharp[Main](../../../samples/core/Miscellaneous/Collations/Program.cs?name=OnModelCreating)]

## Explicit collation in a query

In more extreme cases, the same column needs to be queried using different collations by different queries. For example, one query may need to perform a case-sensitive comparison on a column, while another may need to perform a case-insensitive comparison on the same column. This can be accomplished by explicitly specifying a collation within the query itself:

[!code-csharp[Main](../../../samples/core/Miscellaneous/Collations/Program.cs?name=SimpleQueryCollation)]

This generates a `COLLATE` clause in the SQL query, which forces it to use a case-sensitive collation regardless of the collation defined at the column or database level.

### Explicit collations and indexes

Indexes are one of the most important factors in database performance - a query that runs efficiently with an index can grind to a halt without that index. Indexes implicitly inherit the collation of their column; this means that all queries on the column are automatically eligible to use indexes defined on that column - provided that the query doesn't specify a different collation. Specifying an explicit collation in a query will generally prevent that query from using an index defined that column, since the collations would no longer match - it is therefore recommended to exercise caution when using this feature. It is always preferable to define the collation at the column (or database) level, allowing all queries to implicitly use that collation any benefit from any index.

Note that some databases (e.g. PostgreSQL, Sqlite) allow the collation to be defined when creating an index. This allows multiple indexes to be defined on the same column, speeding up operations with multiple collations (e.g. both case-sensitive and case-insensitive comparisons). Consult your database provider's documentation for more details.

> [!WARNING]
> Always inspect the query plans of your queries, and make sure indexes are being used in performance-critical queries executing over large amounts of data. While it may be tempting to override case-sensitivity in a query via `EF.Functions.Collate` (or by calling `string.ToLower`), this can have a very significant impact on your application's performance.
## Translation of built-in .NET string operations

In .NET, string equality is case-sensitive by default: `s1 == s2` performs an ordinal comparison that requires the strings to be identical. Because the default collation of databases varies, and because it is desirable for simple equality to use indexes, EF Core makes no attempt to translate simple equality to a database case-sensitive operation: a C# equality expression is translated to a SQL equality expression, which may or may not be case-sensitive, depending on the specific database in use (and its collation configuration).

In addition, .NET provides overloads of [`string.Equals`](https://docs.microsoft.com/en-us/dotnet/api/system.string.equals?view=netcore-3.1#System_String_Equals_System_String_System_StringComparison_) accepting a [`StringComparison`](https://docs.microsoft.com/dotnet/api/system.stringcomparison) enum, which allows specifying case-sensitivity and culture for the comparison. By design, EF Core refrains from translating these overloads to SQL, and attempting to use them will result in an exception. For one thing, EF Core does not which case-sensitive or case-insensitive collation should be used. More importantly, forcing the comparison to use a certain collation would in most cases prevent index usage, significantly impacting performance for a very basic and commonly-used .NET construct. To force a query to use case-sensitive or case-insensitive comparison, specify a collation explicitly via `EF.Functions.Collate` as [detailed above](#explicit-collations-and-indexes).

## Database-specific information

* [SQL Server documentation on collations](https://docs.microsoft.com/sql/relational-databases/collations/collation-and-unicode-support)
2 changes: 2 additions & 0 deletions entity-framework/toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,8 @@
href: core/miscellaneous/configuring-dbcontext.md
- name: Nullable reference types
href: core/miscellaneous/nullable-reference-types.md
- name: Collations and case sensitivity
href: core/miscellaneous/collations-and-case-sensitivity.md
- name: Create a model
items:
- name: Overview
Expand Down
14 changes: 14 additions & 0 deletions samples/core/Miscellaneous/Collations/Collations.csproj
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
<Project Sdk="Microsoft.NET.Sdk">

<PropertyGroup>
<TargetFramework>netcoreapp3.1</TargetFramework>
<RootNamespace>EFCollations</RootNamespace>
<AssemblyName>EFCollations</AssemblyName>
</PropertyGroup>

<ItemGroup>
<PackageReference Include="Microsoft.EntityFrameworkCore.SqlServer" Version="5.0.0-preview.3.20181.2" />
<PackageReference Include="Microsoft.Extensions.Logging.Console" Version="5.0.0-preview.3.20181.2" />
</ItemGroup>

</Project>
53 changes: 53 additions & 0 deletions samples/core/Miscellaneous/Collations/Program.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
using System;
using System.Linq;
using Microsoft.EntityFrameworkCore;

namespace EFCollations
{
public class Program
{
static void Main(string[] args)
{
using (var db = new CustomerContext())
{
db.Database.EnsureDeleted();
db.Database.EnsureCreated();
}

using (var context = new CustomerContext())
{
#region SimpleQueryCollation
var customers = context.Customers
.Where(c => EF.Functions.Collate(c.Name, "SQL_Latin1_General_CP1_CS_AS") == "John")
.ToList();
#endregion
}
}
}

public class CustomerContext : DbContext
{
public DbSet<Customer> Customers { get; set; }

protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
{
optionsBuilder.UseSqlServer(@"Server=(localdb)\mssqllocaldb;Database=EFCollations;Trusted_Connection=True;ConnectRetryCount=0");
}

#region OnModelCreating
protected override void OnModelCreating(ModelBuilder modelBuilder)
{
modelBuilder.UseCollation("SQL_Latin1_General_CP1_CS_AS");

modelBuilder.Entity<Customer>().Property(c => c.Name)
.UseCollation("SQL_Latin1_General_CP1_CS_AS");
}
#endregion
}

public class Customer
{
public int BlogId { get; set; }
public string Name { get; set; }
}
}
19 changes: 19 additions & 0 deletions samples/core/Samples.sln
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,8 @@ Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "SqlServer", "SqlServer\SqlS
EndProject
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "ValueConversions", "Modeling\ValueConversions\ValueConversions.csproj", "{FE71504E-C32B-4E2F-9830-21ED448DABC4}"
EndProject
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "Collations", "Miscellaneous\Collations\Collations.csproj", "{62C86664-49F4-4C59-A2EC-1D70D85149D9}"
EndProject
Global
GlobalSection(SolutionConfigurationPlatforms) = preSolution
Debug|Any CPU = Debug|Any CPU
Expand Down Expand Up @@ -375,6 +377,22 @@ Global
{FE71504E-C32B-4E2F-9830-21ED448DABC4}.Release|x64.Build.0 = Release|Any CPU
{FE71504E-C32B-4E2F-9830-21ED448DABC4}.Release|x86.ActiveCfg = Release|Any CPU
{FE71504E-C32B-4E2F-9830-21ED448DABC4}.Release|x86.Build.0 = Release|Any CPU
{62C86664-49F4-4C59-A2EC-1D70D85149D9}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{62C86664-49F4-4C59-A2EC-1D70D85149D9}.Debug|Any CPU.Build.0 = Debug|Any CPU
{62C86664-49F4-4C59-A2EC-1D70D85149D9}.Debug|ARM.ActiveCfg = Debug|Any CPU
{62C86664-49F4-4C59-A2EC-1D70D85149D9}.Debug|ARM.Build.0 = Debug|Any CPU
{62C86664-49F4-4C59-A2EC-1D70D85149D9}.Debug|x64.ActiveCfg = Debug|Any CPU
{62C86664-49F4-4C59-A2EC-1D70D85149D9}.Debug|x64.Build.0 = Debug|Any CPU
{62C86664-49F4-4C59-A2EC-1D70D85149D9}.Debug|x86.ActiveCfg = Debug|Any CPU
{62C86664-49F4-4C59-A2EC-1D70D85149D9}.Debug|x86.Build.0 = Debug|Any CPU
{62C86664-49F4-4C59-A2EC-1D70D85149D9}.Release|Any CPU.ActiveCfg = Release|Any CPU
{62C86664-49F4-4C59-A2EC-1D70D85149D9}.Release|Any CPU.Build.0 = Release|Any CPU
{62C86664-49F4-4C59-A2EC-1D70D85149D9}.Release|ARM.ActiveCfg = Release|Any CPU
{62C86664-49F4-4C59-A2EC-1D70D85149D9}.Release|ARM.Build.0 = Release|Any CPU
{62C86664-49F4-4C59-A2EC-1D70D85149D9}.Release|x64.ActiveCfg = Release|Any CPU
{62C86664-49F4-4C59-A2EC-1D70D85149D9}.Release|x64.Build.0 = Release|Any CPU
{62C86664-49F4-4C59-A2EC-1D70D85149D9}.Release|x86.ActiveCfg = Release|Any CPU
{62C86664-49F4-4C59-A2EC-1D70D85149D9}.Release|x86.Build.0 = Release|Any CPU
EndGlobalSection
GlobalSection(SolutionProperties) = preSolution
HideSolutionNode = FALSE
Expand All @@ -393,6 +411,7 @@ Global
{802E31AD-2F1E-41A1-A662-5929E2626601} = {CA5046EC-C894-4535-8190-A31F75FDEB96}
{63685B9A-1233-4B44-AAC1-8DDD4B16B65D} = {CA5046EC-C894-4535-8190-A31F75FDEB96}
{FE71504E-C32B-4E2F-9830-21ED448DABC4} = {CA5046EC-C894-4535-8190-A31F75FDEB96}
{62C86664-49F4-4C59-A2EC-1D70D85149D9} = {85AFD7F1-6943-40FE-B8EC-AA9DBB42CCA6}
EndGlobalSection
GlobalSection(ExtensibilityGlobals) = postSolution
SolutionGuid = {20C98D35-54EF-46A6-8F3B-1855C1AE4F70}
Expand Down

0 comments on commit f493c2f

Please sign in to comment.