Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classify tutorial #1120

Merged
merged 11 commits into from
Sep 19, 2017
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 80 additions & 0 deletions language/classify_text/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Introduction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should (ideally) use an autogenerated readme. This readme has way too much content and possibly duplicates the tutorial on devsite.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


This sample contains the code referenced in the
[Text Classification Tutorial](http://cloud.google.com/natural-language/docs/classify-text-tutorial) within the Google Cloud Natural Language API Documentation. A full walkthrough of this sample is located within the documentation.

This sample shows how one can use the text classification feature of the Natural Language API to find similar texts based on a query.

## Prerequisites

Set up your
[Cloud Natural Language API project](https://cloud.google.com/natural-language/docs/getting-started#set_up_a_project)
, which includes:

* Enabling the Natural Language API
* Setting up a service account
* Ensuring you've properly set up your `GOOGLE_APPLICATION_CREDENTIALS` for proper
authentication to the service.

## Download the Code

```
$ git clone https://github.com/GoogleCloudPlatform/python-docs-samples.git
$ cd python-docs-samples/language/classify_text
```

## Run the Code

Open a sample folder, create a virtualenv, install dependencies, and run the sample:

```
$ virtualenv env
$ source env/bin/activate
(env)$ pip install -r requirements.txt
```

### Usage

This sample is organized as a script runnable from the command line. It can perform the following tasks:

* Classifies multiple text files and write the result to an "index" file.
* Processes input query text to find similar text files.
* Processes input query category label to find similar text files.

## Classify text

```
python classify_text_tutorial.py classify "$(cat resources/query_text.txt)"
```

Note that the text needs to be sufficiently long for the API to return a non-empty
response.

## Index mulitple text files

```
python classify_text_tutorial.py index resources/texts
```

By default this creates a file `index.json`, which you can specify by passing in the optional `--index_file` argument.

## Query with a category label

The indexed text files can be queried with any of the category labels listed on the [Categories](https://cloud.google.com/natural-language/docs/categories) page.

```
python classify_text_tutorial.py query-category index.json "/Internet & Telecom/Mobile & Wireless"
```

## Query with text

The indexed text files can be queried with another text that might not have been indexed.

```
python classify_text_tutorial.py query index.json "$(cat resources/query_text1.txt)"
```





251 changes: 251 additions & 0 deletions language/classify_text/classify_text_tutorial.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@
# Copyright 2017, Google, Inc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# [START classify_text_tutorial]
"""Using the classify_text method to cluster texts."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring needs to be far more descriptive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. but let me know if I should add more. the tutorial page on cloud.google.com hasn't been published yet.


# [START classify_text_tutorial_import]
import argparse
import json
import os

from google.cloud import language_v1beta2
from google.cloud.language_v1beta2 import enums
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just use language_v1beta2.types and language_v1beta2.enums if you want to save yourself the trouble of importing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to reduce clutter in other samples (e.g. prefer types.Document over lanbuage_v1beta2.Document) I have been importing all three separately. I hope to keep it consistent in this code sample as well.

from google.cloud.language_v1beta2 import types

import numpy as np
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do not alias imports. This also goes into the second section of imports.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

However, aliasing numpy as np seems very common for Python. Later let's revisit the possibility of relaxing the authoring guide to allow this?

# [END classify_text_tutorial_import]


# [START def_classify]
def classify(text, verbose=True):
"""Classify the input text into categories. """

language_client = language_v1beta2.LanguageServiceClient()

document = types.Document(
content=text,
type=enums.Document.Type.PLAIN_TEXT)
categories = language_client.classify_text(document).categories
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better understandability, please assign the result to a temporary variable:

result = language_client.classify_text(document)
categories = result.categories

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


result = {}

for category in categories:
# Turn the categories into a dictionary of the form:
# {category.name: category.confidence}, so that they can
# be treated as a sparse vector.
result[category.name] = category.confidence

if verbose:
print(text)
for category in categories:
print(u'=' * 20)
print(u'{:<16}: {}'.format('category', category.name))
print(u'{:<16}: {}'.format('confidence', category.confidence))

return result
# [END def_classify]


# [START def_index]
def index(path, index_file):
"""Classify each text file in the directory and write
the results to the index_file.
"""

result = {}
for filename in os.listdir(path):
file_path = os.path.join(path, filename)

if not os.path.isfile(file_path):
continue

try:
with open(file_path, 'r') as f:
text = f.read()
categories = classify(text, verbose=False)

result[filename] = categories
except:
print('Failed to process {}'.format(file_path))

with open(index_file, 'w') as f:
json.dump(result, f)

print('Texts indexed in file: {}'.format(index_file))
return result
# [END def_index]


# [START def_split_labels]
def split_labels(categories):
"""The category labels are of the form "/a/b/c" up to three levels,
for example "/Computers & Electronics/Software", and these labels
are used as keys in the categories dictionary, whose values are
confidence scores.

The split_labels function splits the keys into individual levels
while duplicating the confidence score, which allows a natural
boost in how we calculate similarity when more levels are in common.

Example:
If we have

x = {"/a/b/c": 0.5}
y = {"/a/b": 0.5}
z = {"/a": 0.5}

Then x and y are considered more similar than y and z.
"""
_categories = {}
for name, confidence in categories.iteritems():
labels = [label for label in name.split('/') if label]
for label in labels:
_categories[label] = confidence

return _categories
# [END def_split_labels]


# [START def_similarity]
def similarity(categories1, categories2):
"""Cosine similarity of the categories treated as sparse vectors."""
categories1 = split_labels(categories1)
categories2 = split_labels(categories2)

norm1 = np.linalg.norm(categories1.values())
norm2 = np.linalg.norm(categories2.values())

# Return the smallest possible similarity if either categories is empty.
if norm1 == 0 or norm2 == 0:
return 0.0

# Compute the cosine similarity.
dot = 0.0
for label, confidence in categories1.iteritems():
dot += confidence * categories2.get(label, 0.0)

return dot / (norm1 * norm2)
# [END def_similarity]


# [START def_query]
def query(index_file, text, n_top=3):
"""Find the indexed files that are the most similar to
the query text.
"""

with open(index_file, 'r') as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throughout this file, please use io.open over open.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

index = json.load(f)

# Get the categories of the query text.
query_categories = classify(text, verbose=False)

similarities = []
for filename, categories in index.iteritems():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to use six.iteritems(index) for this to work on 2 & 3.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

similarities.append(
(filename, similarity(query_categories, categories)))

similarities = sorted(similarities, key=lambda p: p[1], reverse=True)

print('=' * 20)
print('Query: {}\n'.format(text))
for category, confidence in query_categories.iteritems():
print('\tCategory: {}, confidence: {}'.format(category, confidence))
print('\nMost similar {} indexed texts:'.format(n_top))
for filename, sim in similarities[:n_top]:
print('\tFilename: {}'.format(filename))
print('\tSimilarity: {}'.format(sim))
print('\n')

return similarities
# [END def_query]


# [START def_query_category]
def query_category(index_file, category_string, n_top=3):
"""Find the indexed files that are the most similar to
the query label.

The list of all available labels:
https://cloud.google.com/natural-language/docs/categories
"""

with open(index_file, 'r') as f:
index = json.load(f)

# Make the category_string into a dictionary so that it is
# of the same format as what we get by calling classify.
query_categories = {category_string: 1.0}

similarities = []
for filename, categories in index.iteritems():
similarities.append(
(filename, similarity(query_categories, categories)))

similarities = sorted(similarities, key=lambda p: p[1], reverse=True)

print('=' * 20)
print('Query: {}\n'.format(category_string))
print('\nMost similar {} indexed texts:'.format(n_top))
for filename, sim in similarities[:n_top]:
print('\tFilename: {}'.format(filename))
print('\tSimilarity: {}'.format(sim))
print('\n')

return similarities
# [END def_query_category]


if __name__ == '__main__':
parser = argparse.ArgumentParser(
description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter)
subparsers = parser.add_subparsers(dest='command')
classify_parser = subparsers.add_parser(
'classify', help=classify.__doc__)
classify_parser.add_argument(
'text', help='The text to be classified. '
'The text needs to have at least 20 tokens.')
index_parser = subparsers.add_parser(
'index', help=index.__doc__)
index_parser.add_argument(
'path', help='The directory that contains '
'text files to be indexed.')
index_parser.add_argument(
'--index_file', help='Filename for the output JSON.',
default='index.json')
query_parser = subparsers.add_parser(
'query', help=query.__doc__)
query_parser.add_argument(
'index_file', help='Path to the index JSON file.')
query_parser.add_argument(
'text', help='Query text.')
query_category_parser = subparsers.add_parser(
'query-category', help=query_category.__doc__)
query_category_parser.add_argument(
'index_file', help='Path to the index JSON file.')
query_category_parser.add_argument(
'category', help='Query category.')

args = parser.parse_args()

if args.command == 'classify':
classify(args.text)
if args.command == 'index':
index(args.path, args.index_file)
if args.command == 'query':
query(args.index_file, args.text)
if args.command == 'query-category':
query_category(args.index_file, args.category)
# [END classify_text_tutorial]
45 changes: 45 additions & 0 deletions language/classify_text/classify_text_tutorial_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Copyright 2016, Google, Inc.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os

from classify_text_tutorial import classify
from classify_text_tutorial import similarity
from classify_text_tutorial import split_labels

RESOURCES = os.path.join(os.path.dirname(__file__), 'resources')


def test_classify(capsys):
with open(os.path.join(RESOURCES, 'query_text1.txt'), 'r') as f:
text = f.read()
classify(text)
out, err = capsys.readouterr()
assert 'category' in out


def test_split_labels():
categories = {'/a/b/c': 1.0}
split_categories = {'a': 1.0, 'b': 1.0, 'c': 1.0}
assert split_labels(categories) == split_categories


def test_similarity():
empty_categories = {}
categories1 = {'/a/b/c': 1.0, '/d/e': 1.0}
categories2 = {'/a/b': 1.0}

assert similarity(empty_categories, categories1) == 0.0
assert similarity(categories1, categories1) > 0.99
assert similarity(categories1, categories2) > 0
assert similarity(categories1, categories2) < 1
1 change: 1 addition & 0 deletions language/classify_text/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
google-cloud-language==0.29.0
1 change: 1 addition & 0 deletions language/classify_text/resources/query_text1.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Google Home enables users to speak voice commands to interact with services through the Home's intelligent personal assistant called Google Assistant. A large number of services, both in-house and third-party, are integrated, allowing users to listen to music, look at videos or photos, or receive news updates entirely by voice.
1 change: 1 addition & 0 deletions language/classify_text/resources/query_text2.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The Hitchhiker's Guide to the Galaxy is the first of five books in the Hitchhiker's Guide to the Galaxy comedy science fiction "trilogy" by Douglas Adams (with the sixth written by Eoin Colfer).
1 change: 1 addition & 0 deletions language/classify_text/resources/query_text3.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Goodnight Moon is an American children's picture book written by Margaret Wise Brown and illustrated by Clement Hurd. It was published on September 3, 1947, and is a highly acclaimed example of a bedtime story.
1 change: 1 addition & 0 deletions language/classify_text/resources/texts/android.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Android is a mobile operating system developed by Google, based on the Linux kernel and designed primarily for touchscreen mobile devices such as smartphones and tablets.
1 change: 1 addition & 0 deletions language/classify_text/resources/texts/cat_in_the_hat.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The Cat in the Hat is a children's book written and illustrated by Theodor Geisel under the pen name Dr. Seuss and first published in 1957. The story centers on a tall anthropomorphic cat, who wears a red and white-striped hat and a red bow tie.
1 change: 1 addition & 0 deletions language/classify_text/resources/texts/cloud_computing.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Cloud computing is a computing-infrastructure and software model for enabling ubiquitous access to shared pools of configurable resources (such as computer networks, servers, storage, applications and services), which can be rapidly provisioned with minimal management effort, often over the Internet.
1 change: 1 addition & 0 deletions language/classify_text/resources/texts/eclipse.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
A solar eclipse (as seen from the planet Earth) is a type of eclipse that occurs when the Moon passes between the Sun and Earth, and when the Moon fully or partially blocks (occults) the Sun.
Loading