GoogleCloudPlatform · theacodes · Sep 19, 2017 · Sep 6, 2017 · Sep 6, 2017 · Sep 13, 2017
diff --git a/language/classify_text/README.md b/language/classify_text/README.md
@@ -0,0 +1,80 @@
+# Introduction
+
+This sample contains the code referenced in the 
+[Text Classification Tutorial](http://cloud.google.com/natural-language/docs/classify-text-tutorial) within the Google Cloud Natural Language API Documentation. A full walkthrough of this sample is located within the documentation.
+
+This sample shows how one can use the text classification feature of the Natural Language API to find similar texts based on a query.
+
+## Prerequisites
+
+Set up your 
+[Cloud Natural Language API project](https://cloud.google.com/natural-language/docs/getting-started#set_up_a_project)
+, which includes:
+
+* Enabling the Natural Language API
+* Setting up a service account
+* Ensuring you've properly set up your `GOOGLE_APPLICATION_CREDENTIALS` for proper
+    authentication to the service.
+
+## Download the Code
+
+```
+$ git clone https://github.com/GoogleCloudPlatform/python-docs-samples.git
+$ cd python-docs-samples/language/classify_text
+```
+
+## Run the Code
+
+Open a sample folder, create a virtualenv, install dependencies, and run the sample:
+
+```
+$ virtualenv env
+$ source env/bin/activate
+(env)$ pip install -r requirements.txt
+```
+
+### Usage
+
+This sample is organized as a script runnable from the command line.  It can perform the following tasks:
+
+   * Classifies multiple text files and write the result to an "index" file.
+   * Processes input query text to find similar text files.
+   * Processes input query category label to find similar text files.
+
+## Classify text
+
+```
+python classify_text_tutorial.py classify "$(cat resources/query_text.txt)"
+```
+
+Note that the text needs to be sufficiently long for the API to return a non-empty
+response.
+
+## Index mulitple text files
+
+```
+python classify_text_tutorial.py index resources/texts
+```
+
+By default this creates a file `index.json`, which you can specify by passing in the optional `--index_file` argument.
+
+## Query with a category label
+
+The indexed text files can be queried with any of the category labels listed on the [Categories](https://cloud.google.com/natural-language/docs/categories) page.
+
+```
+python classify_text_tutorial.py query-category index.json "/Internet & Telecom/Mobile & Wireless"
+```
+
+## Query with text
+
+The indexed text files can be queried with another text that might not have been indexed.
+
+```
+python classify_text_tutorial.py query index.json "$(cat resources/query_text1.txt)"
+```
+
+
+
+
+
diff --git a/language/classify_text/classify_text_tutorial.py b/language/classify_text/classify_text_tutorial.py
@@ -0,0 +1,251 @@
+# Copyright 2017, Google, Inc.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# [START classify_text_tutorial]
+"""Using the classify_text method to cluster texts."""
+
+# [START classify_text_tutorial_import]
+import argparse
+import json
+import os
+
+from google.cloud import language_v1beta2
+from google.cloud.language_v1beta2 import enums
+from google.cloud.language_v1beta2 import types
+
+import numpy as np
+# [END classify_text_tutorial_import]
+
+
+# [START def_classify]
+def classify(text, verbose=True):
+    """Classify the input text into categories. """
+
+    language_client = language_v1beta2.LanguageServiceClient()
+
+    document = types.Document(
+        content=text,
+        type=enums.Document.Type.PLAIN_TEXT)
+    categories = language_client.classify_text(document).categories
+
+    result = {}
+
+    for category in categories:
+        # Turn the categories into a dictionary of the form:
+        # {category.name: category.confidence}, so that they can
+        # be treated as a sparse vector.
+        result[category.name] = category.confidence
+
+    if verbose:
+        print(text)
+        for category in categories:
+            print(u'=' * 20)
+            print(u'{:<16}: {}'.format('category', category.name))
+            print(u'{:<16}: {}'.format('confidence', category.confidence))
+
+    return result
+# [END def_classify]
+
+
+# [START def_index]
+def index(path, index_file):
+    """Classify each text file in the directory and write
+    the results to the index_file.
+    """
+
+    result = {}
+    for filename in os.listdir(path):
+        file_path = os.path.join(path, filename)
+
+        if not os.path.isfile(file_path):
+            continue
+
+        try:
+            with open(file_path, 'r') as f:
+                text = f.read()
+                categories = classify(text, verbose=False)
+
+                result[filename] = categories
+        except:
+            print('Failed to process {}'.format(file_path))
+
+    with open(index_file, 'w') as f:
+        json.dump(result, f)
+
+    print('Texts indexed in file: {}'.format(index_file))
+    return result
+# [END def_index]
+
+
+# [START def_split_labels]
+def split_labels(categories):
+    """The category labels are of the form "/a/b/c" up to three levels,
+    for example "/Computers & Electronics/Software", and these labels
+    are used as keys in the categories dictionary, whose values are
+    confidence scores.
+
+    The split_labels function splits the keys into individual levels
+    while duplicating the confidence score, which allows a natural
+    boost in how we calculate similarity when more levels are in common.
+
+    Example:
+    If we have
+
+    x = {"/a/b/c": 0.5}
+    y = {"/a/b": 0.5}
+    z = {"/a": 0.5}
+
+    Then x and y are considered more similar than y and z.
+    """
+    _categories = {}
+    for name, confidence in categories.iteritems():
+        labels = [label for label in name.split('/') if label]
+        for label in labels:
+            _categories[label] = confidence
+
+    return _categories
+# [END def_split_labels]
+
+
+# [START def_similarity]
+def similarity(categories1, categories2):
+    """Cosine similarity of the categories treated as sparse vectors."""
+    categories1 = split_labels(categories1)
+    categories2 = split_labels(categories2)
+
+    norm1 = np.linalg.norm(categories1.values())
+    norm2 = np.linalg.norm(categories2.values())
+
+    # Return the smallest possible similarity if either categories is empty.
+    if norm1 == 0 or norm2 == 0:
+        return 0.0
+
+    # Compute the cosine similarity.
+    dot = 0.0
+    for label, confidence in categories1.iteritems():
+        dot += confidence * categories2.get(label, 0.0)
+
+    return dot / (norm1 * norm2)
+# [END def_similarity]
+
+
+# [START def_query]
+def query(index_file, text, n_top=3):
+    """Find the indexed files that are the most similar to
+    the query text.
+    """
+
+    with open(index_file, 'r') as f:
+        index = json.load(f)
+
+    # Get the categories of the query text.
+    query_categories = classify(text, verbose=False)
+
+    similarities = []
+    for filename, categories in index.iteritems():
+        similarities.append(
+            (filename, similarity(query_categories, categories)))
+
+    similarities = sorted(similarities, key=lambda p: p[1], reverse=True)
+
+    print('=' * 20)
+    print('Query: {}\n'.format(text))
+    for category, confidence in query_categories.iteritems():
+        print('\tCategory: {}, confidence: {}'.format(category, confidence))
+    print('\nMost similar {} indexed texts:'.format(n_top))
+    for filename, sim in similarities[:n_top]:
+        print('\tFilename: {}'.format(filename))
+        print('\tSimilarity: {}'.format(sim))
+        print('\n')
+
+    return similarities
+# [END def_query]
+
+
+# [START def_query_category]
+def query_category(index_file, category_string, n_top=3):
+    """Find the indexed files that are the most similar to
+    the query label.
+
+    The list of all available labels:
+    https://cloud.google.com/natural-language/docs/categories
+    """
+
+    with open(index_file, 'r') as f:
+        index = json.load(f)
+
+    # Make the category_string into a dictionary so that it is
+    # of the same format as what we get by calling classify.
+    query_categories = {category_string: 1.0}
+
+    similarities = []
+    for filename, categories in index.iteritems():
+        similarities.append(
+            (filename, similarity(query_categories, categories)))
+
+    similarities = sorted(similarities, key=lambda p: p[1], reverse=True)
+
+    print('=' * 20)
+    print('Query: {}\n'.format(category_string))
+    print('\nMost similar {} indexed texts:'.format(n_top))
+    for filename, sim in similarities[:n_top]:
+        print('\tFilename: {}'.format(filename))
+        print('\tSimilarity: {}'.format(sim))
+        print('\n')
+
+    return similarities
+# [END def_query_category]
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(
+        description=__doc__,
+        formatter_class=argparse.RawDescriptionHelpFormatter)
+    subparsers = parser.add_subparsers(dest='command')
+    classify_parser = subparsers.add_parser(
+        'classify', help=classify.__doc__)
+    classify_parser.add_argument(
+        'text', help='The text to be classified. '
+        'The text needs to have at least 20 tokens.')
+    index_parser = subparsers.add_parser(
+        'index', help=index.__doc__)
+    index_parser.add_argument(
+        'path', help='The directory that contains '
+        'text files to be indexed.')
+    index_parser.add_argument(
+        '--index_file', help='Filename for the output JSON.',
+        default='index.json')
+    query_parser = subparsers.add_parser(
+        'query', help=query.__doc__)
+    query_parser.add_argument(
+        'index_file', help='Path to the index JSON file.')
+    query_parser.add_argument(
+        'text', help='Query text.')
+    query_category_parser = subparsers.add_parser(
+        'query-category', help=query_category.__doc__)
+    query_category_parser.add_argument(
+        'index_file', help='Path to the index JSON file.')
+    query_category_parser.add_argument(
+        'category', help='Query category.')
+
+    args = parser.parse_args()
+
+    if args.command == 'classify':
+        classify(args.text)
+    if args.command == 'index':
+        index(args.path, args.index_file)
+    if args.command == 'query':
+        query(args.index_file, args.text)
+    if args.command == 'query-category':
+        query_category(args.index_file, args.category)
+# [END classify_text_tutorial]
diff --git a/language/classify_text/classify_text_tutorial_test.py b/language/classify_text/classify_text_tutorial_test.py
@@ -0,0 +1,45 @@
+# Copyright 2016, Google, Inc.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+from classify_text_tutorial import classify
+from classify_text_tutorial import similarity
+from classify_text_tutorial import split_labels
+
+RESOURCES = os.path.join(os.path.dirname(__file__), 'resources')
+
+
+def test_classify(capsys):
+    with open(os.path.join(RESOURCES, 'query_text1.txt'), 'r') as f:
+        text = f.read()
+    classify(text)
+    out, err = capsys.readouterr()
+    assert 'category' in out
+
+
+def test_split_labels():
+    categories = {'/a/b/c': 1.0}
+    split_categories = {'a': 1.0, 'b': 1.0, 'c': 1.0}
+    assert split_labels(categories) == split_categories
+
+
+def test_similarity():
+    empty_categories = {}
+    categories1 = {'/a/b/c': 1.0, '/d/e': 1.0}
+    categories2 = {'/a/b': 1.0}
+
+    assert similarity(empty_categories, categories1) == 0.0
+    assert similarity(categories1, categories1) > 0.99
+    assert similarity(categories1, categories2) > 0
+    assert similarity(categories1, categories2) < 1
diff --git a/language/classify_text/requirements.txt b/language/classify_text/requirements.txt
@@ -0,0 +1 @@
+google-cloud-language==0.29.0
diff --git a/language/classify_text/resources/query_text1.txt b/language/classify_text/resources/query_text1.txt
@@ -0,0 +1 @@
+Google Home enables users to speak voice commands to interact with services through the Home's intelligent personal assistant called Google Assistant. A large number of services, both in-house and third-party, are integrated, allowing users to listen to music, look at videos or photos, or receive news updates entirely by voice.
diff --git a/language/classify_text/resources/query_text2.txt b/language/classify_text/resources/query_text2.txt
@@ -0,0 +1 @@
+The Hitchhiker's Guide to the Galaxy is the first of five books in the Hitchhiker's Guide to the Galaxy comedy science fiction "trilogy" by Douglas Adams (with the sixth written by Eoin Colfer).
diff --git a/language/classify_text/resources/query_text3.txt b/language/classify_text/resources/query_text3.txt
@@ -0,0 +1 @@
+Goodnight Moon is an American children's picture book written by Margaret Wise Brown and illustrated by Clement Hurd. It was published on September 3, 1947, and is a highly acclaimed example of a bedtime story.
diff --git a/language/classify_text/resources/texts/android.txt b/language/classify_text/resources/texts/android.txt
@@ -0,0 +1 @@
+Android is a mobile operating system developed by Google, based on the Linux kernel and designed primarily for touchscreen mobile devices such as smartphones and tablets. 
diff --git a/language/classify_text/resources/texts/cat_in_the_hat.txt b/language/classify_text/resources/texts/cat_in_the_hat.txt
@@ -0,0 +1 @@
+The Cat in the Hat is a children's book written and illustrated by Theodor Geisel under the pen name Dr. Seuss and first published in 1957. The story centers on a tall anthropomorphic cat, who wears a red and white-striped hat and a red bow tie.
diff --git a/language/classify_text/resources/texts/cloud_computing.txt b/language/classify_text/resources/texts/cloud_computing.txt
@@ -0,0 +1 @@
+Cloud computing is a computing-infrastructure and software model for enabling ubiquitous access to shared pools of configurable resources (such as computer networks, servers, storage, applications and services), which can be rapidly provisioned with minimal management effort, often over the Internet. 
diff --git a/language/classify_text/resources/texts/eclipse.txt b/language/classify_text/resources/texts/eclipse.txt
@@ -0,0 +1 @@
+A solar eclipse (as seen from the planet Earth) is a type of eclipse that occurs when the Moon passes between the Sun and Earth, and when the Moon fully or partially blocks (occults) the Sun.