🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Combining multiple files into modifiable archive?

Started by
16 comments, last by Shaarigan 4 years, 2 months ago

Hi,

this might be a stupid question but does anyone know a good solution to combine multiple files into one big file, like an archive, but modifiable?

I have a server that reads and writes a lot of files (for the savegame) and I want to combine them into one big file with compression to save space. Before I write my own implementation I wanted to ask if anyone knows an existing solution. I found some libraries for archives but all of them required me to create a new archive whenever I wanted to modify something…

Advertisement

Technically, you cannot just write arbitrarily inside a file. You can append content, overrwrite content, or throw away everything after position X and replace it with new content; but when you need to replace some item inside an archive with another item which is larger or smaller, you have to recreate the archive anyways. So while a library might hide that fact from you, or you could do the operation as long as the entire archive is inside memory (and you only write the changes to disk), thats probably the reason why all libraries require you to recreate the archive.

Other than implementing a complete virtual disk at the block level, something more in the territory for virtual machine hypervisors, with a filesystem on top, a lessor used feature of zip comes to mind.

In a zip file each file entry is compressed separately, and the file index is written at the end (which has some other advantages). This allows a zip file to be amended by writing a new index, deleting a file/entry by simply not including it, or appending new files. In theory new files could overwrite space used by old files, but you would need to be sure they will fit, there is no “fragmentation” support.

Typically in a case like this though, I can't see why the client couldn't recreate the entire archive/save file from scratch. Although you may still want to use a format such as zip, you can choose not to try and double compress say a JPG or PNG preview image, and such a preview image or other summary data file can be easily read without decompressing potentially the entire thing (as with say a compressed tar archive, e.g .tar.gz), say for a savegame listing.

Compression is a problem because usually you want compressing the whole file to achieve max possible entropy. Changing some entries now causes all other entries beyond the changed one to be recompressed depending on the algorithm you choose or otherwise you need to compress everything at it's own (which may not be optimal at all depending on the data).

So this is quite difficult.

There are already data algorithms that can however handle dynamic add/modifiy/remove data and also handle the space properly, but they are usually used in database programming. So TL;DR, you have to implement your own database to achieve what you are asking for.

I wrote a single-file databae already once for a game design tool. In general, you have to maintain some kind of indexing header that knows at which point in your file the data is. Then you need some kind of free space table to determine where to put new data in or where is space left to extend data. I then used memory mapped files and 64k chunking (a single chunk has had 64k and could contain N instances of either an indexing, free space or data table which had different sizes like 2k, 4k, 8k) and the data table was also chunked internally into 64 byte blocks.

When I now had to write data, I got the first data atble chunk which also had at least one chunk left to fill. The address was then written into my index table (a B+Tree) and a data header was created at this chunk with some meta information and pointing to the first chunk that contained the data. Then I did the same for as long as there were bytes left until all data was fit into a chain of 64byte chunks. I didn't care of fragmentation at all here so my data was saved as a linked list propably across different data tables and table chunks

Well, I guess I have to implement my own system then. Recreating a zip with 1000s of files just takes too long. I have written a database system, I could theoretically use that to store the files and then just compress it after closing the file. This should be faster than writing 1000s of files individually.

Edit: Apparently I had the same idea as Shaarigan. My data is stored differently, single entries are always in one block with some extra space in case they are modified and need more space. If they don't fit, the whole entry is moved to a free block that is big enough. I have to optimize this a bit though, I think.

Beosar said:
Well, I guess I have to implement my own system then. Recreating a zip with 1000s of files just takes too long.

Each save is 1000's of files, the majority of which do not change? That seems a very odd save system, you might want to look at how that itself works.

Even with 1000's of entries, assuming the total size is fairly small (since you do want to upload it) building a zip archive from scratch should be very fast even without the workarounds to “update” one as I described. Whatever your disks sequential write speed is for post-compression output (over 100MB/s even for a HDD generally), and deflate/inflate should run at many times that on a modern CPU thread.

So lately, I've been working on reverse-engineering the PSP version of Final Fantasy IV to scrape its assets, so I can use them in a FF4 clone that a friend and I are writing. The PSP ISO contains 2 large .bin files: one that's like 330MB, and another that's 120KB. The larger file is literally an archive containing the entire game's assets while the smaller file serves as a directory. All of those individual file entries have LZS compression on top of them IIRC, and then a good chunk of all those tiny images have LZSS compression on top of that…

Anyway, as cool as that sounds, I'm still trying to figure out why they'd do something like that when .zip exists. Does .zip have some sort of royalty licensing attached to it like the .mp3 format has? I have no idea…

But yeah, creating a .zip archive could solve that. I don't have any clue as to what kind of performance/overhead impacts that it'll have though.

Currently learning Rust

A few things come to mind where a simple non -standard solution might be better, especially with enough work hours behind it to make the utility tools.

  • Compression ratio of LZS vs deflate (couldn't find a comparison quickly). Although they could have used a different compression with the zip format.
  • With zip having the index at end, this is convenient for easy writing (don't need to precalculate/buffer to write or reserve space for it) and updating, but not for reading as you have to go to the end and start looking backwards for the index start. You mentioned iso, but how does the real thing store the data, can it access the index “file” directly, e.g it's own memory chip or known offset?
  • The zip format duplicates some data in “local headers” for each file, along with some other potentially unneeded per-entry stuff (about 30 bytes iirc). You could save a small amount of space.

A lot of that i don't think really matters for modern pc hardware vs the cost of doing so for a single application.

@SyncViews Although I've built and queried .zip archives via libzip years ago, I've never actually looked at the file spec itself. Although I do find that interesting that the ZIP format puts the index information last instead of at the beginning of the file because read scenarios will be more common, if not, even more common than write scenarios.

After a bit more thought, storing the index data at the end is a lot more convenient than I initially thought because each time that index has to be updated, the ZIP archive would only need to move the index information down to make more room for new data which could be arbitrarily large whereas the index is usually pretty small and therefore far less taxing to offset as the ZIP archive grows in files/size. The same wouldn't be true if the index was towards the top, I suppose because if your ZIP archive is already 200MB of files, then you'd have to move all 200MB of data down in the binary package to make room for each new entry.

Are there any known archive formats that can be written to in parallel? Maybe that could help with the large file sizes. I'm not sure how (or if) that could work in a practical sense.

Currently learning Rust

I think I read years ago that the design accounts for media such as tape or optical disks that are write once or at least very slow to go back to the start, along with the complexity of figuring out how much space must be left on such systems. I believe the format also allows for spanning multiple physical disks as a multi part archive.

travistrue said:

Although I do find that interesting that the ZIP format puts the index information last instead of at the beginning of the file because read scenarios will be more common, if not, even more common than write scenarios.

Even if reading is expected to be common, I think really it's just a slightly more complex but of programming to read (probably more than made up for by the easier programming to write). Other than maybe on tape (where I guess you lose half the time with any format) I don't think reading blocks from the end has a significant penalty, and in practice most of the time will be spent reading and decompressing the actual entries.

The local headers I am less sure on the reason given. One thing that comes to mind is they allow for a good degree of data recovery of otherwise intact data if the central directory at the end is lost or damaged.

travistrue said:

Are there any known archive formats that can be written to in parallel? Maybe that could help with the large file sizes. I'm not sure how (or if) that could work in a practical sense.

I can't think of one. It's the sort of thing filesystems and databases do, generally by having a lot of extra space to play with as they don't try and compact everything as much as possible. At best if you write blocks from multiple “streams” as they come while you might get good “sequential” write performance, you will horribly fragment the contents which would be a massive impact on reading single files back out, and at best make efficiently extracting the entire archive difficult.

More practically, there are compression algorithms that can use multiple threads, or you could maybe make use of the large amounts of RAM. Maybe something only for SSD's would be possible due to the lower read seek penalties, and buffering multiple commands, but never seen anything.

This topic is closed to new replies.

Advertisement