-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Classify tutorial #1120
Classify tutorial #1120
Changes from 5 commits
61a3db9
dd951dc
f863a3b
2f033dc
7e23349
4a1ec66
86c5294
3091fee
bf52056
3853a1b
0f1a77e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
# Introduction | ||
|
||
This sample contains the code referenced in the | ||
[Text Classification Tutorial](http://cloud.google.com/natural-language/docs/classify-text-tutorial) within the Google Cloud Natural Language API Documentation. A full walkthrough of this sample is located within the documentation. | ||
|
||
This sample shows how one can use the text classification feature of the Natural Language API to find similar texts based on a query. | ||
|
||
## Prerequisites | ||
|
||
Set up your | ||
[Cloud Natural Language API project](https://cloud.google.com/natural-language/docs/getting-started#set_up_a_project) | ||
, which includes: | ||
|
||
* Enabling the Natural Language API | ||
* Setting up a service account | ||
* Ensuring you've properly set up your `GOOGLE_APPLICATION_CREDENTIALS` for proper | ||
authentication to the service. | ||
|
||
## Download the Code | ||
|
||
``` | ||
$ git clone https://github.com/GoogleCloudPlatform/python-docs-samples.git | ||
$ cd python-docs-samples/language/classify_text | ||
``` | ||
|
||
## Run the Code | ||
|
||
Open a sample folder, create a virtualenv, install dependencies, and run the sample: | ||
|
||
``` | ||
$ virtualenv env | ||
$ source env/bin/activate | ||
(env)$ pip install -r requirements.txt | ||
``` | ||
|
||
### Usage | ||
|
||
This sample is organized as a script runnable from the command line. It can perform the following tasks: | ||
|
||
* Classifies multiple text files and write the result to an "index" file. | ||
* Processes input query text to find similar text files. | ||
* Processes input query category label to find similar text files. | ||
|
||
## Classify text | ||
|
||
``` | ||
python classify_text_tutorial.py classify "$(cat resources/query_text.txt)" | ||
``` | ||
|
||
Note that the text needs to be sufficiently long for the API to return a non-empty | ||
response. | ||
|
||
## Index mulitple text files | ||
|
||
``` | ||
python classify_text_tutorial.py index resources/texts | ||
``` | ||
|
||
By default this creates a file `index.json`, which you can specify by passing in the optional `--index_file` argument. | ||
|
||
## Query with a category label | ||
|
||
The indexed text files can be queried with any of the category labels listed on the [Categories](https://cloud.google.com/natural-language/docs/categories) page. | ||
|
||
``` | ||
python classify_text_tutorial.py query-category index.json "/Internet & Telecom/Mobile & Wireless" | ||
``` | ||
|
||
## Query with text | ||
|
||
The indexed text files can be queried with another text that might not have been indexed. | ||
|
||
``` | ||
python classify_text_tutorial.py query index.json "$(cat resources/query_text1.txt)" | ||
``` | ||
|
||
|
||
|
||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,251 @@ | ||
# Copyright 2017, Google, Inc. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done. |
||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
# [START classify_text_tutorial] | ||
"""Using the classify_text method to cluster texts.""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This docstring needs to be far more descriptive. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done. but let me know if I should add more. the tutorial page on cloud.google.com hasn't been published yet. |
||
|
||
# [START classify_text_tutorial_import] | ||
import argparse | ||
import json | ||
import os | ||
|
||
from google.cloud import language_v1beta2 | ||
from google.cloud.language_v1beta2 import enums | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you can just use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In order to reduce clutter in other samples (e.g. prefer |
||
from google.cloud.language_v1beta2 import types | ||
|
||
import numpy as np | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please do not alias imports. This also goes into the second section of imports. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. However, aliasing numpy as np seems very common for Python. Later let's revisit the possibility of relaxing the authoring guide to allow this? |
||
# [END classify_text_tutorial_import] | ||
|
||
|
||
# [START def_classify] | ||
def classify(text, verbose=True): | ||
"""Classify the input text into categories. """ | ||
|
||
language_client = language_v1beta2.LanguageServiceClient() | ||
|
||
document = types.Document( | ||
content=text, | ||
type=enums.Document.Type.PLAIN_TEXT) | ||
categories = language_client.classify_text(document).categories | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For better understandability, please assign the result to a temporary variable: result = language_client.classify_text(document)
categories = result.categories There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done. |
||
|
||
result = {} | ||
|
||
for category in categories: | ||
# Turn the categories into a dictionary of the form: | ||
# {category.name: category.confidence}, so that they can | ||
# be treated as a sparse vector. | ||
result[category.name] = category.confidence | ||
|
||
if verbose: | ||
print(text) | ||
for category in categories: | ||
print(u'=' * 20) | ||
print(u'{:<16}: {}'.format('category', category.name)) | ||
print(u'{:<16}: {}'.format('confidence', category.confidence)) | ||
|
||
return result | ||
# [END def_classify] | ||
|
||
|
||
# [START def_index] | ||
def index(path, index_file): | ||
"""Classify each text file in the directory and write | ||
the results to the index_file. | ||
""" | ||
|
||
result = {} | ||
for filename in os.listdir(path): | ||
file_path = os.path.join(path, filename) | ||
|
||
if not os.path.isfile(file_path): | ||
continue | ||
|
||
try: | ||
with open(file_path, 'r') as f: | ||
text = f.read() | ||
categories = classify(text, verbose=False) | ||
|
||
result[filename] = categories | ||
except: | ||
print('Failed to process {}'.format(file_path)) | ||
|
||
with open(index_file, 'w') as f: | ||
json.dump(result, f) | ||
|
||
print('Texts indexed in file: {}'.format(index_file)) | ||
return result | ||
# [END def_index] | ||
|
||
|
||
# [START def_split_labels] | ||
def split_labels(categories): | ||
"""The category labels are of the form "/a/b/c" up to three levels, | ||
for example "/Computers & Electronics/Software", and these labels | ||
are used as keys in the categories dictionary, whose values are | ||
confidence scores. | ||
|
||
The split_labels function splits the keys into individual levels | ||
while duplicating the confidence score, which allows a natural | ||
boost in how we calculate similarity when more levels are in common. | ||
|
||
Example: | ||
If we have | ||
|
||
x = {"/a/b/c": 0.5} | ||
y = {"/a/b": 0.5} | ||
z = {"/a": 0.5} | ||
|
||
Then x and y are considered more similar than y and z. | ||
""" | ||
_categories = {} | ||
for name, confidence in categories.iteritems(): | ||
labels = [label for label in name.split('/') if label] | ||
for label in labels: | ||
_categories[label] = confidence | ||
|
||
return _categories | ||
# [END def_split_labels] | ||
|
||
|
||
# [START def_similarity] | ||
def similarity(categories1, categories2): | ||
"""Cosine similarity of the categories treated as sparse vectors.""" | ||
categories1 = split_labels(categories1) | ||
categories2 = split_labels(categories2) | ||
|
||
norm1 = np.linalg.norm(categories1.values()) | ||
norm2 = np.linalg.norm(categories2.values()) | ||
|
||
# Return the smallest possible similarity if either categories is empty. | ||
if norm1 == 0 or norm2 == 0: | ||
return 0.0 | ||
|
||
# Compute the cosine similarity. | ||
dot = 0.0 | ||
for label, confidence in categories1.iteritems(): | ||
dot += confidence * categories2.get(label, 0.0) | ||
|
||
return dot / (norm1 * norm2) | ||
# [END def_similarity] | ||
|
||
|
||
# [START def_query] | ||
def query(index_file, text, n_top=3): | ||
"""Find the indexed files that are the most similar to | ||
the query text. | ||
""" | ||
|
||
with open(index_file, 'r') as f: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Throughout this file, please use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done. |
||
index = json.load(f) | ||
|
||
# Get the categories of the query text. | ||
query_categories = classify(text, verbose=False) | ||
|
||
similarities = [] | ||
for filename, categories in index.iteritems(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You'll need to use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done. |
||
similarities.append( | ||
(filename, similarity(query_categories, categories))) | ||
|
||
similarities = sorted(similarities, key=lambda p: p[1], reverse=True) | ||
|
||
print('=' * 20) | ||
print('Query: {}\n'.format(text)) | ||
for category, confidence in query_categories.iteritems(): | ||
print('\tCategory: {}, confidence: {}'.format(category, confidence)) | ||
print('\nMost similar {} indexed texts:'.format(n_top)) | ||
for filename, sim in similarities[:n_top]: | ||
print('\tFilename: {}'.format(filename)) | ||
print('\tSimilarity: {}'.format(sim)) | ||
print('\n') | ||
|
||
return similarities | ||
# [END def_query] | ||
|
||
|
||
# [START def_query_category] | ||
def query_category(index_file, category_string, n_top=3): | ||
"""Find the indexed files that are the most similar to | ||
the query label. | ||
|
||
The list of all available labels: | ||
https://cloud.google.com/natural-language/docs/categories | ||
""" | ||
|
||
with open(index_file, 'r') as f: | ||
index = json.load(f) | ||
|
||
# Make the category_string into a dictionary so that it is | ||
# of the same format as what we get by calling classify. | ||
query_categories = {category_string: 1.0} | ||
|
||
similarities = [] | ||
for filename, categories in index.iteritems(): | ||
similarities.append( | ||
(filename, similarity(query_categories, categories))) | ||
|
||
similarities = sorted(similarities, key=lambda p: p[1], reverse=True) | ||
|
||
print('=' * 20) | ||
print('Query: {}\n'.format(category_string)) | ||
print('\nMost similar {} indexed texts:'.format(n_top)) | ||
for filename, sim in similarities[:n_top]: | ||
print('\tFilename: {}'.format(filename)) | ||
print('\tSimilarity: {}'.format(sim)) | ||
print('\n') | ||
|
||
return similarities | ||
# [END def_query_category] | ||
|
||
|
||
if __name__ == '__main__': | ||
parser = argparse.ArgumentParser( | ||
description=__doc__, | ||
formatter_class=argparse.RawDescriptionHelpFormatter) | ||
subparsers = parser.add_subparsers(dest='command') | ||
classify_parser = subparsers.add_parser( | ||
'classify', help=classify.__doc__) | ||
classify_parser.add_argument( | ||
'text', help='The text to be classified. ' | ||
'The text needs to have at least 20 tokens.') | ||
index_parser = subparsers.add_parser( | ||
'index', help=index.__doc__) | ||
index_parser.add_argument( | ||
'path', help='The directory that contains ' | ||
'text files to be indexed.') | ||
index_parser.add_argument( | ||
'--index_file', help='Filename for the output JSON.', | ||
default='index.json') | ||
query_parser = subparsers.add_parser( | ||
'query', help=query.__doc__) | ||
query_parser.add_argument( | ||
'index_file', help='Path to the index JSON file.') | ||
query_parser.add_argument( | ||
'text', help='Query text.') | ||
query_category_parser = subparsers.add_parser( | ||
'query-category', help=query_category.__doc__) | ||
query_category_parser.add_argument( | ||
'index_file', help='Path to the index JSON file.') | ||
query_category_parser.add_argument( | ||
'category', help='Query category.') | ||
|
||
args = parser.parse_args() | ||
|
||
if args.command == 'classify': | ||
classify(args.text) | ||
if args.command == 'index': | ||
index(args.path, args.index_file) | ||
if args.command == 'query': | ||
query(args.index_file, args.text) | ||
if args.command == 'query-category': | ||
query_category(args.index_file, args.category) | ||
# [END classify_text_tutorial] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
# Copyright 2016, Google, Inc. | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
import os | ||
|
||
from classify_text_tutorial import classify | ||
from classify_text_tutorial import similarity | ||
from classify_text_tutorial import split_labels | ||
|
||
RESOURCES = os.path.join(os.path.dirname(__file__), 'resources') | ||
|
||
|
||
def test_classify(capsys): | ||
with open(os.path.join(RESOURCES, 'query_text1.txt'), 'r') as f: | ||
text = f.read() | ||
classify(text) | ||
out, err = capsys.readouterr() | ||
assert 'category' in out | ||
|
||
|
||
def test_split_labels(): | ||
categories = {'/a/b/c': 1.0} | ||
split_categories = {'a': 1.0, 'b': 1.0, 'c': 1.0} | ||
assert split_labels(categories) == split_categories | ||
|
||
|
||
def test_similarity(): | ||
empty_categories = {} | ||
categories1 = {'/a/b/c': 1.0, '/d/e': 1.0} | ||
categories2 = {'/a/b': 1.0} | ||
|
||
assert similarity(empty_categories, categories1) == 0.0 | ||
assert similarity(categories1, categories1) > 0.99 | ||
assert similarity(categories1, categories2) > 0 | ||
assert similarity(categories1, categories2) < 1 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
google-cloud-language==0.29.0 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
Google Home enables users to speak voice commands to interact with services through the Home's intelligent personal assistant called Google Assistant. A large number of services, both in-house and third-party, are integrated, allowing users to listen to music, look at videos or photos, or receive news updates entirely by voice. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
The Hitchhiker's Guide to the Galaxy is the first of five books in the Hitchhiker's Guide to the Galaxy comedy science fiction "trilogy" by Douglas Adams (with the sixth written by Eoin Colfer). |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
Goodnight Moon is an American children's picture book written by Margaret Wise Brown and illustrated by Clement Hurd. It was published on September 3, 1947, and is a highly acclaimed example of a bedtime story. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
Android is a mobile operating system developed by Google, based on the Linux kernel and designed primarily for touchscreen mobile devices such as smartphones and tablets. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
The Cat in the Hat is a children's book written and illustrated by Theodor Geisel under the pen name Dr. Seuss and first published in 1957. The story centers on a tall anthropomorphic cat, who wears a red and white-striped hat and a red bow tie. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
Cloud computing is a computing-infrastructure and software model for enabling ubiquitous access to shared pools of configurable resources (such as computer networks, servers, storage, applications and services), which can be rapidly provisioned with minimal management effort, often over the Internet. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
A solar eclipse (as seen from the planet Earth) is a type of eclipse that occurs when the Moon passes between the Sun and Earth, and when the Moon fully or partially blocks (occults) the Sun. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should (ideally) use an autogenerated readme. This readme has way too much content and possibly duplicates the tutorial on devsite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.