Nostrum 2 is a minimal music player focused on achieving the highest playback quality available. It uses the WASAPI low-level Windows audio API and handles all necessary internal operations on the audio at 32-bit resolution.
Esc - Exit
Ctrl + left click - Move the window
Ctrl + right click - Resize the window
W - Shuffle
S - Toggle playback
X - Toggle the waveform visualizer
Q - Reduce volume
E - Increase volume
A - Reverse
D - Fast forward
Z - Previous track
C - Next track
Nostrum was designed to put audio quality first. For example, it uses exclusive mode playback so that other programs can't play audio, giving Nostrum bit-perfect playback without the mixing from Windows that would be required otherwise. For this reason, I decided to only implement FLAC audio file loading. In Nostrum for Android and Nostrum 3, many different formats are supported, but Nostrum 2 follows a more minimalist approach.
Traditionally, players decode as they play the track, discarding data that has been played and keeping a buffer between what is playing and what is loaded. However, RAM is not as much of an issue these days and decompressed tracks are only around 100 MB. Loading the track all at once is simpler and prevents any issues that can arise from trying to maintain a playback buffer.
Because only FLAC is supported, the FLAC library can be used directly to load the files. Decoding with the library is straightforward. Create a decoder instance, initialize it with the desired file, write callback, metadata callback, error callback, and data pointer. First, the metadata callback is called, providing information like number of audio channels, sample rate, bits per sample, and total samples in the track.
A sample is simply a numerical value which describes the amplitude of the sound wave at a given time. The bits per sample correlate to how much precision each sample has. For example, 16 bits per sample is common, meaning each sample can range from −32,768 to 32,767 (65,536 possible values). Some tracks even have 32 bits per sample (4,294,967,296 possible values). Each audio channel is a different waveform, so each has its separate set of samples. Sample rate describes how many samples there are per second. 44,100 is a common sample rate. As you might imagine, the more samples and the higher precision of each sample, the more accurate the representation of the audio waveform. However, higher than 16 bit 44,100 samples per second is generally considered to be above what human hearing can distinguish.
Below is a visualization of an audio waveform of a 2 channel, 16 bit, 44,100 hz track, zoomed in to show a section that is about .003 seconds long. The dots throughout the waveforms each represent a sample. As you can see, the waveform is interpolated between these samples, so the true precision is much greater than the samples themselves. Still, the interpolation can be innacurate if there are too few samples to go off of, or if the interpolation is done poorly. The interpolation of the samples into an analog waveform is done by a device called a digital to analog converter (DAC). The waveform then goes from your DAC to your speakers.
After the metadata is recieved, the file will then be decoded in frames (groups of samples) and the write callback will be called each time a frame is ready. The number of samples in the frame will be provided along with a buffer containing the samples. The buffer can then be copied to another array which will accumulate the entire track.
To play the loaded data, we need to use an audio API so that the application can interface with the OS, which talks to the DAC. The Windows audio session API (WASAPI) provides the most control over playback on Windows, although like most Windows APIs, it is a bit of a pain to work with.
The default audio device is retrieved and activated. Then, the desired playback format is tested on the audio device to make sure it is supported. An audio client is then retrieved from the audio device and initialized to use a raw, exclusive stream with the device's minimum latency. If the latency is not aligned with the buffer, the audio client must be recreated with the aligned latency. Finally, an audio render client is retrieved from the audio client and an event object is bound to it.
The thread doing the playback is then given the "Pro Audio" MMCSS (multimedia class scheduler service) thread characteristic and a high MMCSS thread priority. The first buffer is filled with silence. In a loop, the thread waits for the previously mentioned event object. This event object signals that the audio render client needs its buffer filled with track audio. When the event is recieved, the buffer is filled with the appropriate number of samples and then released to be sent to the DAC. The process repeats until there are no more samples in the song.