Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Face detection beta features #1414

Merged
merged 18 commits into from
Mar 27, 2018

Conversation

dizcology
Copy link
Member

@dizcology dizcology commented Mar 19, 2018

This PR adds samples for the following beta features:

  • face bounding boxes
  • face emotions
  • speech transcription

@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Mar 19, 2018
@dizcology
Copy link
Member Author

tested against the released client library 1.1.0.

emotions = frame.attributes[0].emotions

# from videointelligence.enums
emotion_labels = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to use the videointelligence.enums object directly or pull this programmatically? How badly will this break if the enum changes in the future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simplest way is for the enums in videointelligence.enums to extend python's Enums class, which would allow us to map the enums back to their names. However the client libraries do not do that (yet).

I don't know of a clean way to pull the names programmatically, and as it stands now the sample code will break in these possible ways:

  1. the order of the enums changes (unlikely), in which case the printout of the sample will be incorrect.

  2. more enums are added (possible) and the data we use for testing receive those added enums in the API responses, in which case we would get an IndexError in our testing.

  3. (worst) more enums are added, but the testing data does not receive them, so we don't notice it is broken. Users relying on the same tuple to map enums to names might get IndexError with their data.

The best long term solution is to simply extend Enums. When that happens we will need to come back and update these samples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Programmatically, this keeps showing up in Python samples and, as a user, it seems really frustrating that Python makes you do this... you get the proto's enum value index for enums via the library but not the enum value name?

I don't think any of the other languages have this problem?

(Make this better! 😭)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without fixing the client library generation tool, this is the closest I could come up with:

from google.cloud import videointelligence_v1p1beta1 as videointelligence
emotion_label = {value:name for name, value in vars(videointelligence.enums.Emotion).iteritems() if not name.startswith('__')}

I would rather not have this as is in a code snippet - but perhaps abstracted away as a helper function? The reader of the sample would not be able to see the full list of enums this way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I wasn't thinking about fixing this in the code sample, but rather in google-cloud-core

IMO The code sample should use an inline, literal List with the values. The display name of the enum value should be printed by getting it by index. As we do today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move this to a thread on google-cloud-core 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code in this PR regarding Enum LGTM

Copy link
Contributor

@nnegrey nnegrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, mostly stuff for discussion


Usage Examples:
python beta_snippets.py boxes \
gs://python-docs-samples-tests/video/googlework_short.mp4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make another demo project to hold public files that can be used across languages?

# limitations under the License.

"""This application demonstrates face detection, label detection,
explicit content, and shot change detection using the Google Cloud API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... face detection, face emotions, video transcription

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

# include_bounding_boxes must be set to True for include_emotions
# to work.
config = videointelligence.types.FaceConfig(
include_bounding_boxes=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still required? I thought we had them pull that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.



# [START video_face_bounding_boxes]
def face_bounding_boxes(path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can they use a local file too or just a GCS file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we only need to provide local file snippets, no GCS snippets?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are showing only GCS snippets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OH, do Python samples all use path for GCS? I figured that was for a local file

It should be uri, based on existing samples, eg: https://cloud.google.com/vision/docs/detecting-labels#vision-label-detection-python

def detect_labels(path):
    """Detects labels in the file."""
    client = vision.ImageAnnotatorClient()
def detect_labels_uri(uri):
    """Detects labels in the file located in Google Cloud Storage or on the
    Web."""
    client = vision.ImageAnnotatorClient()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this change across the board to all 3 samples

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right - it should be uri, or rather gcs_uri (since currently the API only supports that). fixed.

print('\tSegment: {}\n'.format(positions))

# There are typically many frames for each face,
# here we print information on only the first frame.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment confused me a little, especially with the possibility of segments above.
"There are typically many frames for each face, since a face is in a segment of time" (I don't know if that helps clarify it or just makes it more confusing)

Could a face appear in multiple segments?

Like:
Face 1: (Segment 0:00 --> 0:05, 0:15 --> 0:25) then the face would be in all the frames between those segments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I slightly reworded this comment, please take a look.

For face detection, each face indeed is a segment. If the same person appears in two segments of the same video, two separate faces will be returned, each with its own segment.

for time_offset, emotion_label, score in frame_emotions:
print('\t{:04.2f}s: {:14}({:4.3f})'.format(
time_offset, emotion_label, score))
print('\n')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we merely print out the results without sorting based on score? Will help simplify the code, if we were going across the whole segment to determine the most likely emotion across all frames I think that would be cool, but not as simple.

Also by printing out the time_offset, isn't it always the same, since we only get the one frame for each face?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here actually we are sorting by time_offset - an earlier version of of the API did not sort the frames by it. Emotion is a frame level output since detecting the change of emotions is useful.

The output might look like this:

	30.56s: INTEREST      (0.606)
	30.60s: INTEREST      (0.582)
	30.63s: INTEREST      (0.561)
	30.66s: AMUSEMENT     (0.559)
	30.70s: AMUSEMENT     (0.556)
	30.73s: AMUSEMENT     (0.549)
	30.76s: CONTENTMENT   (0.564)
	30.80s: CONTENTMENT   (0.634)
	30.83s: CONTENTMENT   (0.688)


features = [
videointelligence.enums.Feature.SPEECH_TRANSCRIPTION
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


result = operation.result(timeout=180)

annotation_results = result.annotation_results[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment that notes you are only pulling out the first result or is there only one result?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor

@nnegrey nnegrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great (1 tiny fix)

positions = '{}s to {}s'.format(start_time, end_time)
print('\tSegment: {}\n'.format(positions))

# Each detected may appear in many frames of the video.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each detected face
(I like the rewording)


emotion, score = sorted(
[(em.emotion, em.score) for em in emotions],
key=lambda p: p[1])[-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-trivial logic alert!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nnegrey I think I misunderstood your comment about this part - indeed I was sorting by emotion scores. The reason being that the API returns one score for each emotion, and here I am trying to show only the one that scores the highest.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment here to clarify what I was doing. please take a look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is really impressive:

  • multiple return values
  • sorting
  • list comprehension
  • lambda
  • -1 index used


# every emotion gets a score, here we sort them by
# scores and keep only the one that scores the highest.
most_likely_emotion = sorted(emotions, key=lambda em: em.score)[-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still somewhat impressive!

  • sorting
  • lambda expression
  • -1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified. Thanks for taking the time explaining the issues to me!



# [START video_speech_transcription]
def speech_transcription(input_uri):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other 2 functions use gcs_uri but this uses input_uri

We should make them consistent (and I'd TAL at whatever variable we use in all of the other Vision samples and use that for consistency)

@andrewsg andrewsg merged commit 93124e0 into GoogleCloudPlatform:master Mar 27, 2018
busunkim96 pushed a commit to busunkim96/python-videointelligence that referenced this pull request May 20, 2020
danoscarmike pushed a commit to googleapis/python-videointelligence that referenced this pull request Sep 30, 2020
parthea pushed a commit to googleapis/google-cloud-python that referenced this pull request Sep 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants