-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cnnturk] Add new extractor #32671
base: master
Are you sure you want to change the base?
[cnnturk] Add new extractor #32671
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank for your work! And apologies that it's been in the stack so long.
I don't think this site has a corresponding Support Request issue, so please create one, or, maybe easier, paste the completed template that you would have submitted as a comment here.
I've made some suggestions and also applied one to make the tests more compact. The main point of that was to trigger the CI tests, which seemed to have been turned off for the PR. As to the actual change, apart from layout, it's tricky to test description
fields, which may be quite long. If you use the MD5 of the value, the test value becomes short but you can't tell how the value has changed if the test starts failing; so I prefer a regular expression test.
_VALID_URL = r'''(?x) | ||
https?:// | ||
(?:www\.)?cnnturk\.com/ | ||
(?: | ||
tv-cnn-turk/programlar/| | ||
video/| | ||
turkiye/| | ||
dunya/| | ||
ekonomi/ | ||
) | ||
(?:[^/]+/)* | ||
(?P<id>[^/?#&]+) | ||
''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preferable layout:
_VALID_URL = r'''(?x) | |
https?:// | |
(?:www\.)?cnnturk\.com/ | |
(?: | |
tv-cnn-turk/programlar/| | |
video/| | |
turkiye/| | |
dunya/| | |
ekonomi/ | |
) | |
(?:[^/]+/)* | |
(?P<id>[^/?#&]+) | |
''' | |
_VALID_URL = r'''(?x) | |
https?://(?:www\.)?cnnturk\.com/ | |
(?: | |
tv-cnn-turk/programlar| | |
video|turkiye|dunya|ekonomi | |
)/ | |
(?:[^/]+/)*(?P<id>[^/?#&]+) | |
''' |
video_url = video_info['MediaFiles'][0]['Path'] | ||
if not video_url.startswith("http"): | ||
video_url = 'https://cnnvod.duhnet.tv/' + video_url | ||
extension = 'mp4' if video_url.endswith('mp4') else 'm3u8' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use utils.determine_ext()
:
extension = 'mp4' if video_url.endswith('mp4') else 'm3u8' | |
extension = determine_ext(video_url) |
if not video_url.startswith("http"): | ||
video_url = 'https://cnnvod.duhnet.tv/' + video_url |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use utils.urljoin()
:
if not video_url.startswith("http"): | |
video_url = 'https://cnnvod.duhnet.tv/' + video_url | |
video_url = urljoin('https://cnnvod.duhnet.tv/', video_url) |
# Video info is a JSON object inside a script tag | ||
video_info = self._parse_json( | ||
self._search_regex( | ||
r'({"Ancestors":.+?\);)', webpage, 'stream')[:-2], | ||
video_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe they'll send the JS in a different order. I think a pattern like
r'script[^>]*> [\w\s=]\(\s*function\s*\([\w\s]+\)\s*\{[^})]*\}\s*\)\s*\(\s*(\{.+?\})\s*\)[;\s]*</script'
would be safer.
Once #32725 is merged, you can use _search_json()
, which tries to get the JSON even if the search pattern matches extra text after the wanted JSON.
r'({"Ancestors":.+?\);)', webpage, 'stream')[:-2], | ||
video_id) | ||
|
||
video_url = video_info['MediaFiles'][0]['Path'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a media_files
too, with the same list when I checked. Using utils.traverse_obj()
you can get the URL from either, in case one or the other is missing:
video_url = video_info['MediaFiles'][0]['Path'] | |
video_url = traverse_obj(video_info, ( | |
('MediaFiles', 'media_files), 0, 'Path'), get_all=False) |
formats = [{ | ||
'url': video_url, | ||
'ext': extension, | ||
'language': 'tr', | ||
}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can extract the formats from a m3u8 manifest with this pattern:
formats = [{ | |
'url': video_url, | |
'ext': extension, | |
'language': 'tr', | |
}] | |
if extension = 'm3u8': | |
formats = self._extract_m3u8_formats( | |
video_url, video_id, ext='mp4', | |
entry_protocol='m3u8_native', # this is probably OK but needs to be tested | |
fatal=False) | |
else: | |
formats = [{ | |
'url': video_url, | |
'ext': extension, | |
}] | |
for f in formats: | |
f.setdefault('language', 'tr') |
}] | ||
|
||
return { | ||
'id': video_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be display_id
and video_id
should be the UUID that is value of video_data['_Id']
and is also the value of the id
and data-id
attributes in a <div>
with class player-container
. Don 't you think?
'release_date': video_info['published_date'], | ||
'upload_date': video_info['created_date'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are optional so the code shouldn't crash if the respective field is missing. Also,
there is a time field too that can be used. Define an inline function using utils.parse_iso8601()
(ISO 8601 dates are like yyyy-mm-ddThh:mm:ss, roughly):
def get_datetime(v, when):
dt = v.get(when + '_date', '').strip()
if not dt.isdigit() or len(dt) < 8: # year 10000 bug!
return
dt = '-'.join((dt[:4], dt[4:6], dt[6:]))
return parse_iso8601(
'T'.join((dt, v.get(when + '_time', '0:00:00'))))
Then:
'release_date': video_info['published_date'], | |
'upload_date': video_info['created_date'], | |
'release_date': get_datetime(video_info, 'published') | |
'upload_date': get_datetime(video_info, 'created_date') |
Please follow the guide below
x
into all the boxes [ ] relevant to your pull request (like that [x])Before submitting a pull request make sure you have:
In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:
What is the purpose of your pull request?
Description of your pull request and other information
CNN Turk is one of the most popular TV channels in Turkey. Their website contains recordings of their livestreams dating as early as 2007. I wanted to add youtube-dl support for this website.