Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering based on meta data for each youtube URL #17

Open
6 tasks
daniel-z-kaplan opened this issue Aug 9, 2023 · 2 comments
Open
6 tasks

Filtering based on meta data for each youtube URL #17

daniel-z-kaplan opened this issue Aug 9, 2023 · 2 comments
Assignees
Labels
data All things data

Comments

@daniel-z-kaplan
Copy link

daniel-z-kaplan commented Aug 9, 2023

Inappropriate content: NSFW, hate speech, offensive words, sentiments
Title, captions, tags, comments
Quality data:
Length, Aspect Ratio, Resolution, likes, views, number of subscribers of the creator, comments filtered with language model, ensure some amount of movement in the video
Language: English for now

(From Michael) the final version of the dataset from the collection side of things will look like the following:

  • channels.tsv with columns ['link', 'name', 'description', 'subscribers', 'isFamilySafe', 'tags']
  • videos.tsv with columns ['channel_link', 'id', 'title', 'date', 'length', 'views']
  • a sample with a 2 videos from a random set of channels listed in channels.tsv can be found in the DuckAI google drive (note there may be some duplicates!!)

Loose tasks:

  • Data analysis on metadata in the TSVs
  • Build pipeline to collect metadata using video2dataset (may need to alter video2dataset)
  • Analysis on NSFW (using some kind of NSFW filter)
  • English filter
  • Pixel based filters using thumbnails (see comments below on youtube's auto generated thumbnails)
  • Brainstorm more ideas on filtering to have high quality video!
@daniel-z-kaplan daniel-z-kaplan added the data All things data label Aug 9, 2023
@mtanghu
Copy link
Member

mtanghu commented Aug 10, 2023

We could use VidGear/CamGear to do frame based filtering/streaming (it allow for reading in frames of a video and streaming without needing to download)

@mtanghu mtanghu closed this as completed Sep 5, 2023
@mtanghu mtanghu reopened this Sep 5, 2023
@mtanghu
Copy link
Member

mtanghu commented Sep 13, 2023

There are also 4 autogenerated thumbnails from Youtube for each video (from different parts of the video)

https://stackoverflow.com/questions/2068344/how-do-i-get-a-youtube-video-thumbnail-from-the-youtube-api

@tomohiro-sawada tomohiro-sawada self-assigned this Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data All things data
Projects
None yet
Development

No branches or pull requests

5 participants