Deduplication of uploaded files
After discussion and thinking, here are my findings about the subject:
- Deduplication would be good for storage
- Deduplication would require a model change : store the audio files themselves in a separate table, bind this one to a track, and store the user_library <-> audiofile relation
- Deduplication would impact quota management
- Proper deduplication cannot rely on file hash only, because any change in audio headers would result in a different hash, while the audio remains the same
- We can have a better hashing by stripping all metadata from file, and compute a hash on the remaining part. This way, if headers change but not the audio stream, the hash remain the same
- If we plan to have a client that can skip uploading for already available files, but still mark a track available in a user library, we have to be careful:
- Here again, we cannot rely on file hash only: someone could share publicly a list of popular hash, and a user could simulate an upload with those hash, without having access to the actual files
- Using our better hash only is not enough, for the same reason
- We need to think about this in term of challenge: we need a way to prove that the user has access to the file before uploading. I think this is doable by computing a challenge result on server side, for instance store a hash of (better_hash + instance_name), and ensure the client can submit the same thing.
PoC using ffmpeg-python
There are non-technical considerations as well in next parts, don't hesitate to skip this.
import ffmpeg
import subprocess
audio_file_path = 'test.mp3'
with open(audio_file_path, 'rb') as f:
original_file_content = f.read()
# Based on https://github.com/kkroening/ffmpeg-python/issues/49 to avoid loading file in memory
strip_tags_command = ffmpeg.input('pipe:').output('pipe:', **{'map_metadata': '-1', 'vcodec': 'copy', 'acodec': 'copy', 'format': 'mp3'})
strip_tag_process = subprocess.Popen(['ffmpeg'] + strip_tags_command.get_args(), stdin=subprocess.PIPE, stdout=subprocess.PIPE)
no_tag_file_content, err = strip_tag_process.communicate(input=original_file_content)
# tags were stripped, so the content do not match
assert no_tag_file_content != original_file_content
# now, open the original file in Musicbrainz Picard, add a new tag like "Hello world", then load its content again
with open(audio_file_path, 'rb') as f:
new_tag_file_content = f.read()
# new tag was added, so the new and original content do not match
assert original_file_content != new_tag_file_content
# we strip the headers on the new file content, and ensure the result match our previous attempt
strip_tag_process = subprocess.Popen(['ffmpeg'] + strip_tags_command.get_args(), stdin=subprocess.PIPE, stdout=subprocess.PIPE)
out, err = strip_tag_process.communicate(input=new_tag_file_content)
assert out == no_tag_file_content
Based on the previous snippet, we can see we can reliably strip headers from audio file using ffmpeg and use the result to store a more reliable hash of uploaded files. On the server side, we could store this hash, and also a challenge based on this hash:
import hashlib
instance_name = 'demo.funkwhale.audio'
uploaded_file_hash = hashlib.sha1(no_tag_file_content).hexdigest()
challenge_body = no_tag_file_content + instance_name
challenge = hashlib.sha1(challenge_body).hexdigest()
This challenge is simple to compute for clients (because using the instance name, or other public, instance specific info), and do not require extra resources on the server (we compute it once and we're good). It makes it difficult to exploit a list of music files hash found in the wild, it would require having a hash for each instance of Funkwhale. However, it's not perfect nor future proof.
A more involved approach could use a dynamic part as the challenge body. Instead of sending a hash with the instance name appended to the file content as a challenge, the client would do something like that:
import datetime
now = datetime.datetime.now().isoformat()
local_file_hash = hashlib.sha1(no_tag_file_content).hexdigest()
challenge_body = no_tag_file_content + now
challenge = hashlib.sha1(challenge_body).hexdigest()
requests.post('/upload/check/', {'challenge_input': now, 'challenge_result': challenge, 'file_hash': local_file_hash})
When receiving that, the server could do the same steps and match the result with the client specified value. However, this would require reding the file for each challenge, resulting in heavier resource usage.
Quota and deduplication
One thing to consider when we deduplicate is how we handle user quota. At the moment, it's pretty simple: if user has a 1Gb quota and uploads 100Mb of file, their remaining quota is 900Mb. This makes less sense with deduplication, since the same file, uploaded in ten libraries, would result in only one file being actually stored on the server. How do we manage this case when we compute users quota? I can see at least three possibilities:
- We don't do anything: if 10 users upload the same 10Mb track, each one will have 10Mb used in their quota. Pros: simple to manage. Cons: less storage available to users, a bit akward since it's not matching our underlying storage model
- The first uploader "wins": if 10 users upload the same 10Mb track, only the first one will see its quota occupied. Pros: more storage available for the other users. Cons: not really nice for the first user, what happens if the first user delete the track in their library?
- Divide usage: if 10 users upload the same 10Mb track, each one will have 1Mb used in their quota. Pros: more storage available for everyone. Cons: more intensive on database size, less visibility for user on their own quota, since it can go up and down based on other users activity
A like 3. the most, even if it's more intensive, because it's fair for everyone. New users can upload existing files and only use a fraction of the required quota, and users uploading popular tracks will get their quota back, allowing them to upload more stuff. It's more collaborative and less individualistic than other options.
Conclusion
Let me know what you think about those ideas, and feel free to suggest new one and pinpoint errors :)
Regardless of all of that, I think deduplication should wait a bit before being implemented, as we'll probably a lot to work on once we get feedback about the 0.17 release, so we can take the time to figure this out properly.