This is a project outline, and I build on it as I progress.
Product: Streamlined process to create a labeled data set of images and text based on video content on the web, which already has images and spoken words. Create a neural net for images and words.
Are there similar products in the market?
Yes, here are a few that accomplish something similar.
Clarity is open source, and labels objects in an image or video frames. Developed by founders of Clarifai, and uses the Clarifai SDK. Unclear if Clarifai is open source.
Supervisely processes videos and labels.
Benefit: Labeled datasets are needed to train neural nets via supervised learning.
Values related to software development:
Implement existing open source software so that a custom implementation is possible if needed. Keep code clean and easy for open source contribution. This is a nice collection of articles on open-source software architecture.
Points on developing neural net architecture:
For evaluating neural architecture, try the existing successful architectures, such as VGG16, and also evaluate new structures if possible.
The progression of neural net architecture is outlined in review papers. Two main ideas are to increase layers and change the connectivity between layers, meaning some earlier layers connect directly with layers close to the end of a deep net.
Examine the structure and parameters of the neural net using ideas in statistics which are traditionally used to understand systems that have some random behavior, the pseudo-random system here is the neural net architecture.
Use both deep learning, where the neural net architecture is more nebulous to understand, statistical insights where the fundamentals are better understood.
Examine ideas in statistical mechanics for understanding processes that have significant randomness, such as Poisson statistics, population growth in Fisher statistics, Boltzmann statistics, and others.
Compare the meaning of weights in NN and coefficients in expansion in basis sets.
Links to some papers which seek to understand neural net architecture:
https://arxiv.org/pdf/1611.03530.pdf by MIT, Google, and Berkeley
Some decision on a database will be needed. Redis is open source.
Record audio and process. Audacity is a free open source and can record from Youtube videos. YouTube API may be needed to automate some tasks, such as getting links for all the videos in a specific category.
Record video frames.
For speech to text conversion, there are several open source options, Kaldi or DeepSpeech can be tried.
Eliminate articles and extra speech words, keep only key vocabulary.
Label video frames with key words in speech.
Object labeling is also possible with open source software and existing neural nets for image processing and object detection. These include OpenCV, TensorFlow, Keras, Caffe, Google Colab. The aim here is to use more contextual labels, which are provided in the speech of the video, such as the word ‘cutting’ might be associated with a ‘knife’. Another example is some foods get associated with ‘cutting’ and others with ‘shredding’.
Method can be implemented with videos in different topic areas, like travel and National Geographic documentaries, or cooking videos.
1. Open source code that reads in a video and creates a labeled dataset
2. Neural net that takes in an image and gives high probability labels for it, or vice versa, given a set of words, provides an image that matches
Image generation can be based on a generative adversarial network (GAN).
Dream use of product:
Neural net provides stylized images for a story. Example, images for a time travel story. Another example is as a language translator with the difference being it is a word to visual translation. The speaker can keep adding words until a satisfactory image appears, similar to playing Pictionary.
An actual labeled dataset cannot be provided because that would violate copyright laws
Example: Alps, Mountains, 12,000 ft, snow, cold, climbing, challenge
Explore building a pipeline that integrates different open source architectures. When successful, this is very powerful. When not successful, it opens up opportunities for key areas.
A next step is to build a distributed or concurrent transfer learning pipeline, so that once the NN is built, it can become more accurate as more videos are uploaded on the web.
Working through size and computational requirements
Audio file size = bit rate *duration of audio in seconds * number of channels
Where Bit rate = bit depth * sample rate
Typical audio size: 53 MB for 5 minutes
Size depends on .wav vs. mp3. Determine the minimum audio size for preserving speech.
Size of a database that has been successfully used to train neural nets
ImageNet — 14 million images in 20,000 categories
Most videos have 24 frames per second. We can sample at 1 frame per 3 seconds, and then remove the frames that are very similar to each other. By trial and error, we have to decide on how many frames from a video will be selected. For 10 million images, we will then need to process approximately 1 year of video and audio.
Audio Analysis, Statistical tools for time-series data
Characteristics of time-series data (Wang, Smith, Hyndman in Data Mining and Knowledge Discovery)
· Trend — a long-term change in the mean
· Seasonality — large autocorrelation coefficient at the seasonal lag
· Serial correlation