[cnnturk] Add new extractor #32671

erendn · 2023-12-22T09:40:10Z

Please follow the guide below

You will be asked some questions, please read them carefully and answer honestly
Put an x into all the boxes [ ] relevant to your pull request (like that [x])
Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

Searched the bugtracker for similar pull requests
Read adding new extractor tutorial
Read youtube-dl coding conventions and adjusted the code to meet them
Covered the code with tests (note that PRs without tests will be REJECTED)
Checked the code with flake8

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Bug fix
Improvement
New extractor
New feature

Description of your pull request and other information

CNN Turk is one of the most popular TV channels in Turkey. Their website contains recordings of their livestreams dating as early as 2007. I wanted to add youtube-dl support for this website.

youtube_dl/extractor/cnnturk.py

dirkf

Thank for your work! And apologies that it's been in the stack so long.

I don't think this site has a corresponding Support Request issue, so please create one, or, maybe easier, paste the completed template that you would have submitted as a comment here.

I've made some suggestions and also applied one to make the tests more compact. The main point of that was to trigger the CI tests, which seemed to have been turned off for the PR. As to the actual change, apart from layout, it's tricky to test description fields, which may be quite long. If you use the MD5 of the value, the test value becomes short but you can't tell how the value has changed if the test starts failing; so I prefer a regular expression test.

dirkf · 2024-02-22T16:01:47Z

youtube_dl/extractor/cnnturk.py

+    _VALID_URL = r'''(?x)
+                https?://
+                    (?:www\.)?cnnturk\.com/
+                    (?:
+                        tv-cnn-turk/programlar/|
+                        video/|
+                        turkiye/|
+                        dunya/|
+                        ekonomi/
+                    )
+                    (?:[^/]+/)*
+                    (?P<id>[^/?#&]+)
+                '''


dirkf · 2024-02-22T16:05:52Z

youtube_dl/extractor/cnnturk.py

+        video_url = video_info['MediaFiles'][0]['Path']
+        if not video_url.startswith("http"):
+            video_url = 'https://cnnvod.duhnet.tv/' + video_url
+        extension = 'mp4' if video_url.endswith('mp4') else 'm3u8'


Use utils.determine_ext():

Suggested change

extension = 'mp4' if video_url.endswith('mp4') else 'm3u8'

extension = determine_ext(video_url)

dirkf · 2024-02-22T16:09:11Z

youtube_dl/extractor/cnnturk.py

+        if not video_url.startswith("http"):
+            video_url = 'https://cnnvod.duhnet.tv/' + video_url


Use utils.urljoin():

Suggested change

if not video_url.startswith("http"):

video_url = 'https://cnnvod.duhnet.tv/' + video_url

video_url = urljoin('https://cnnvod.duhnet.tv/', video_url)

dirkf · 2024-02-22T16:40:03Z

youtube_dl/extractor/cnnturk.py

+        # Video info is a JSON object inside a script tag
+        video_info = self._parse_json(
+            self._search_regex(
+                r'({"Ancestors":.+?\);)', webpage, 'stream')[:-2],
+            video_id)


Maybe they'll send the JS in a different order. I think a pattern like
r'script[^>]*> [\w\s=]\(\s*function\s*\([\w\s]+\)\s*\{[^})]*\}\s*\)\s*\(\s*(\{.+?\})\s*\)[;\s]*</script' would be safer.

Once #32725 is merged, you can use _search_json(), which tries to get the JSON even if the search pattern matches extra text after the wanted JSON.

dirkf · 2024-02-22T16:53:00Z

youtube_dl/extractor/cnnturk.py

+                r'({"Ancestors":.+?\);)', webpage, 'stream')[:-2],
+            video_id)
+
+        video_url = video_info['MediaFiles'][0]['Path']


There's a media_files too, with the same list when I checked. Using utils.traverse_obj() you can get the URL from either, in case one or the other is missing:

Suggested change

video_url = video_info['MediaFiles'][0]['Path']

video_url = traverse_obj(video_info, (

('MediaFiles', 'media_files), 0, 'Path'), get_all=False)

dirkf · 2024-02-22T17:05:58Z

youtube_dl/extractor/cnnturk.py

+        formats = [{
+            'url': video_url,
+            'ext': extension,
+            'language': 'tr',
+        }]


We can extract the formats from a m3u8 manifest with this pattern:

Suggested change

formats = [{

'url': video_url,

'ext': extension,

'language': 'tr',

}]

if extension = 'm3u8':

formats = self._extract_m3u8_formats(

video_url, video_id, ext='mp4',

entry_protocol='m3u8_native', # this is probably OK but needs to be tested

fatal=False)

else:

formats = [{

'url': video_url,

'ext': extension,

}]

for f in formats:

f.setdefault('language', 'tr')

dirkf · 2024-02-22T17:14:33Z

youtube_dl/extractor/cnnturk.py

+        }]
+
+        return {
+            'id': video_id,


This should be display_id and video_id should be the UUID that is value of video_data['_Id'] and is also the value of the id and data-id attributes in a <div> with class player-container. Don 't you think?

dirkf · 2024-02-22T17:48:09Z

youtube_dl/extractor/cnnturk.py

+            'release_date': video_info['published_date'],
+            'upload_date': video_info['created_date'],


These are optional so the code shouldn't crash if the respective field is missing. Also,
there is a time field too that can be used. Define an inline function using utils.parse_iso8601() (ISO 8601 dates are like yyyy-mm-ddThh:mm:ss, roughly):

def get_datetime(v, when): dt = v.get(when + '_date', '').strip() if not dt.isdigit() or len(dt) < 8: # year 10000 bug! return dt = '-'.join((dt[:4], dt[4:6], dt[6:])) return parse_iso8601( 'T'.join((dt, v.get(when + '_time', '0:00:00'))))

Then:

Suggested change

'release_date': video_info['published_date'],

'upload_date': video_info['created_date'],

'release_date': get_datetime(video_info, 'published')

'upload_date': get_datetime(video_info, 'created_date')

[cnnturk] Add new extractor

f6368b6

dirkf reviewed Feb 22, 2024

View reviewed changes

youtube_dl/extractor/cnnturk.py Outdated Show resolved Hide resolved

Lay out _TESTS more compactly

f295f4f

dirkf requested changes Feb 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cnnturk] Add new extractor #32671

[cnnturk] Add new extractor #32671

erendn commented Dec 22, 2023

dirkf left a comment

dirkf Feb 22, 2024

dirkf Feb 22, 2024

dirkf Feb 22, 2024

dirkf Feb 22, 2024

dirkf Feb 22, 2024

dirkf Feb 22, 2024

dirkf Feb 22, 2024

dirkf Feb 22, 2024

	extension = 'mp4' if video_url.endswith('mp4') else 'm3u8'
	extension = determine_ext(video_url)

		if not video_url.startswith("http"):
		video_url = 'https://cnnvod.duhnet.tv/' + video_url

	if not video_url.startswith("http"):
	video_url = 'https://cnnvod.duhnet.tv/' + video_url
	video_url = urljoin('https://cnnvod.duhnet.tv/', video_url)

	video_url = video_info['MediaFiles'][0]['Path']
	video_url = traverse_obj(video_info, (
	('MediaFiles', 'media_files), 0, 'Path'), get_all=False)

		'release_date': video_info['published_date'],
		'upload_date': video_info['created_date'],

[cnnturk] Add new extractor #32671

Are you sure you want to change the base?

[cnnturk] Add new extractor #32671

Conversation

erendn commented Dec 22, 2023

Please follow the guide below

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

Description of your pull request and other information

dirkf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment