Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-sentence title formatting to sentence case and title case #6872

Merged
merged 7 commits into from
Oct 3, 2020
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Use regular expression to split multiple-sentence titles

## Context and Problem Statement

Some entry titles are composed of multiple sentences, for example: "Whose Music? A Sociology of Musical Language", therefore, it is necessary to first split the title into sentences and process them individually to ensure proper formatting using '[Sentence Case](https://en.wiktionary.org/wiki/sentence_case)' or '[Title Case](https://en.wiktionary.org/wiki/title_case#English)'

## Considered Options

* [Regular expression](https://docs.oracle.com/javase/tutorial/essential/regex/)
* [OpenNLP](https://opennlp.apache.org/)
* [ICU4J](http://site.icu-project.org/home)

## Decision Outcome

Chosen option: "Regular expression", because we can use Java internal classes (Pattern, Matcher) instead of adding additional dependencies

### Positive Consequences

* Less dependencies on third party libraries
* Smaller project size (ICU4J is very large)
* No need for model data (OpenNLP is a machine learning based toolkit and needs a trained model to work properly)

### Negative Consequences

* Regular expressions can never cover every case, therefore, splitting may not be accurate for every title

Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
package org.jabref.logic.formatter.casechanger;

import java.util.stream.Collectors;

import org.jabref.logic.cleanup.Formatter;
import org.jabref.logic.l10n.Localization;
import org.jabref.model.strings.StringUtil;

public class SentenceCaseFormatter extends Formatter {

Expand All @@ -20,11 +23,16 @@ public String getKey() {
*/
@Override
public String format(String input) {
Title title = new Title(new LowerCaseFormatter().format(input));

title.getWords().stream().findFirst().ifPresent(Word::toUpperFirst);

return title.toString();
return StringUtil.getStringAsSentences(input)
.stream()
.map(new LowerCaseFormatter()::format)
.map(Title::new)
.map(title -> {
title.getFirstWord().ifPresent(Word::toUpperFirst);
return title;
})
.map(Object::toString)
.collect(Collectors.joining(" "));
}

@Override
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
package org.jabref.logic.formatter.casechanger;

import java.util.stream.Collectors;

import org.jabref.logic.cleanup.Formatter;
import org.jabref.logic.l10n.Localization;
import org.jabref.model.strings.StringUtil;

public class TitleCaseFormatter extends Formatter {

Expand All @@ -22,21 +25,26 @@ public String getKey() {
*/
@Override
public String format(String input) {
Title title = new Title(input);

title.getWords().stream().filter(Word::isSmallerWord).forEach(Word::toLowerCase);
title.getWords().stream().filter(Word::isLargerWord).forEach(Word::toUpperFirst);

title.getFirstWord().ifPresent(Word::toUpperFirst);
title.getLastWord().ifPresent(Word::toUpperFirst);

for (int i = 0; i < (title.getWords().size() - 2); i++) {
if (title.getWords().get(i).endsWithColon()) {
title.getWords().get(i + 1).toUpperFirst();
}
}

return title.toString();
return StringUtil.getStringAsSentences(input)
.stream()
.map(sentence -> {
Title title = new Title(sentence);

title.getWords().stream().filter(Word::isSmallerWord).forEach(Word::toLowerCase);
title.getWords().stream().filter(Word::isLargerWord).forEach(Word::toUpperFirst);

title.getFirstWord().ifPresent(Word::toUpperFirst);
title.getLastWord().ifPresent(Word::toUpperFirst);

for (int i = 0; i < (title.getWords().size() - 2); i++) {
if (title.getWords().get(i).endsWithColon()) {
title.getWords().get(i + 1).toUpperFirst();
}
}

return title.toString();
})
.collect(Collectors.joining(" "));
}

@Override
Expand Down
9 changes: 9 additions & 0 deletions src/main/java/org/jabref/model/strings/StringUtil.java
Original file line number Diff line number Diff line change
Expand Up @@ -720,6 +720,15 @@ public static List<String> getStringAsWords(String text) {
return Arrays.asList(text.split("[\\s,;]+"));
}

/**
* Returns a list of sentences contained in the given text.
*/
public static List<String> getStringAsSentences(String text) {
// A sentence ends with a .?!;, but not in the case of "Mr.", "Ms.", "Mrs.", "Dr.", "st.", "jr.", "co.", "inc.", and "ltd."
Pattern splitTextPattern = Pattern.compile("(?<=[\\.!;\\?])(?<![Mm](([Rr]|[Rr][Ss])|[Ss])\\.|[Dd][Rr]\\.|[Ss][Tt]\\.|[Jj][Rr]\\.|[Cc][Oo]\\.|[Ii][Nn][Cc]\\.|[Ll][Tt][Dd]\\.)\\s+");
return Arrays.asList(splitTextPattern.split(text));
}

@ApacheCommonsLang3Allowed("No direct Guava equivalent existing - see https://stackoverflow.com/q/16560635/873282")
public static boolean containsIgnoreCase(String text, String searchString) {
return StringUtils.containsIgnoreCase(text, searchString);
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
package org.jabref.logic.formatter.casechanger;

import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
import java.util.stream.Stream;

import org.junit.jupiter.params.ParameterizedTest;
import org.junit.jupiter.params.provider.Arguments;
import org.junit.jupiter.params.provider.MethodSource;

import static org.junit.jupiter.api.Assertions.assertEquals;

Expand All @@ -10,23 +13,30 @@
*/
public class SentenceCaseFormatterTest {

private SentenceCaseFormatter formatter;

@BeforeEach
public void setUp() {
formatter = new SentenceCaseFormatter();
}
private final SentenceCaseFormatter formatter = new SentenceCaseFormatter();

@Test
public void test() {
assertEquals("Upper first", formatter.format("upper First"));
assertEquals("Upper first", formatter.format("uPPER FIRST"));
assertEquals("Upper {NOT} first", formatter.format("upper {NOT} FIRST"));
assertEquals("Upper {N}ot first", formatter.format("upper {N}OT FIRST"));
private static Stream<Arguments> testData() {
return Stream.of(
Arguments.of("Upper first", "upper First"),
Arguments.of("Upper first", "uPPER FIRST"),
Arguments.of("Upper {NOT} first", "upper {NOT} FIRST"),
Arguments.of("Upper {N}ot first", "upper {N}OT FIRST"),
Arguments.of("Whose music? A sociology of musical language",
"Whose music? a sociology of musical language"),
Arguments.of("Bibliographic software. A comparison.",
"bibliographic software. a comparison."),
Arguments.of("England’s monitor; The history of the separation",
"England’s Monitor; the History of the Separation"),
Arguments.of("Dr. schultz: a dentist turned bounty hunter.",
"Dr. schultz: a dentist turned bounty hunter."),
Arguments.of("Example case. {EXCLUDED SENTENCE.}",
"Example case. {EXCLUDED SENTENCE.}"),
Arguments.of("I have {Aa} dream", new SentenceCaseFormatter().getExampleInput()));
}

@Test
public void formatExample() {
assertEquals("I have {Aa} dream", formatter.format(formatter.getExampleInput()));
@ParameterizedTest
@MethodSource("testData")
public void test(String expected, String input) {
assertEquals(expected, formatter.format(input));
}
}
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
package org.jabref.logic.formatter.casechanger;

import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
import java.util.stream.Stream;

import org.junit.jupiter.params.ParameterizedTest;
import org.junit.jupiter.params.provider.Arguments;
import org.junit.jupiter.params.provider.MethodSource;

import static org.junit.jupiter.api.Assertions.assertEquals;

Expand All @@ -10,8 +13,43 @@
*/
public class TitleCaseFormatterTest {

private TitleCaseFormatter formatter;
private final TitleCaseFormatter formatter = new TitleCaseFormatter();

private static Stream<Arguments> testData() {
return Stream.of(
Arguments.of("Upper Each First", "upper each first"),
Arguments.of("Upper Each First", "upper eACH first"),
Arguments.of("An Upper Each First And", "an upper each first and"),
Arguments.of("An Upper Each First And", "an upper each first AND"),
Arguments.of("An Upper Each of the and First And",
"an upper each of the and first and"),
Arguments.of("An Upper Each of the and First And",
"an upper each of the AND first and"),
Arguments.of("An Upper Each of: The and First And",
"an upper each of: the and first and"),
Arguments.of("An Upper First with and without {CURLY} {brackets}",
"AN UPPER FIRST WITH AND WITHOUT {CURLY} {brackets}"),
Arguments.of("An Upper First with {A}nd without {C}urly {b}rackets",
"AN UPPER FIRST WITH {A}ND WITHOUT {C}URLY {b}rackets"),
Arguments.of("{b}rackets {b}rac{K}ets Brack{E}ts",
"{b}RaCKeTS {b}RaC{K}eTS bRaCK{E}ts"),
Arguments.of("Two Experiences Designing for Effective Security",
"Two experiences designing for effective security"),
Arguments.of("Bibliographic Software. A Comparison.",
"bibliographic software. a comparison."),
Arguments.of("Bibliographic Software. {A COMPARISON.}",
"bibliographic software. {A COMPARISON.}"),
Arguments.of("{BPMN} Conformance in Open Source Engines",
new TitleCaseFormatter().getExampleInput()));
}

@ParameterizedTest
@MethodSource("testData")
public void test(String expected, String input) {
assertEquals(expected, formatter.format(input));
}

/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why these tests are commented out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are redundant as I already added them as parameterized tests but I didn't know if I should delete them yet since they contain the descriptions of the tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then just remove them. I think it's clear from the arguments what is tested

@BeforeEach
public void setUp() {
formatter = new TitleCaseFormatter();
Expand Down Expand Up @@ -80,8 +118,21 @@ public void testTwoExperiencesTitle() {
formatter.format("Two experiences designing for effective security"));
}

@Test
public void testSimpleTwoSentenceTitle() {
assertEquals("Bibliographic Software. A Comparison.",
formatter.format("bibliographic software. a comparison."));
}

@Test
public void secondSentenceInBracketsIsLeftUnchanged() {
assertEquals("Bibliographic Software. {A COMPARISON.}",
formatter.format("bibliographic software. {A COMPARISON.}"));
}

@Test
public void formatExample() {
assertEquals("{BPMN} Conformance in Open Source Engines", formatter.format(formatter.getExampleInput()));
}
*/
}