Reading a whole file at once

0

Most of the programs that I write are filter-like utilities: the program starts up, processes data as it goes, then ends. Usually, these programs don’t have to store a lot of data in memory. If I can find a way to only read data a character or a block at a time, I’ll do it that way. This is probably because I learned C programming on DOS, and DOS systems usually have very limited memory. From the start, my programming practice has been load only what you need into memory.

Recently, I’ve started working on a project that requires loading the contents of a data file into memory, and working with the copy in memory. The data file is a text file created by the user, so it could be a few lines long, or a hundred lines. You can approach this using two methods, depending on the system. Because this kind of programming problem comes up all the time in larger projects, I wanted to share an example of how to load a file into memory all at once.

The traditional method

One classic way to load a complete file is to get the size of the file, allocate enough memory to store it, then read the file into memory. At a high level, this requires four steps:

  1. open() to open the file
  2. filelength() or fstat() to get the size of the file in bytes
  3. calloc() to allocate the memory
  4. read() to read the file into memory

Let’s say I wanted to load the contents of a file into an array so I could work on it. I might create a function called open_file() that reads a file into memory; for the purposes of keeping my demonstration a simple one, let’s only deal with one file at a time, and load the contents into a global string array called Fdata, of size Fsize:

char *Fdata;
size_t Fsize;

size_t open_file(const char *filename)
{
    int fd;
    int nread;

    fd = open(filename, O_RDONLY);
    if (fd < 0) {
        puts("cannot open file");
        close(fd);
        return 0;
    }

    Fsize = filelength(fd);

    Fdata = calloc(sizeof(char), Fsize);
    if (Fdata == NULL) {
        puts("out of memory");
        close(fd);
        return 0;
    }

    nread = read(fd, Fdata, Fsize);
    Fsize = nread;

    close(fd);

    return Fsize;
}

This function takes a single argument: the name of a file to load into memory. It uses fd = open(filename, O_RDONLY) to open the file in read-only mode, and stores the file descriptor as fd. I wrote this sample program on DOS, so I used Fsize = filelength(fd) to get the size of the file in bytes, and store it in Fsize. Note that filelength() is only available on DOS; on Linux, you might use the fstat function to get the file statistics, including file size.

The function uses Fdata = calloc(sizeof(char), Fsize) to allocate enough memory in Fdata to store the full contents of the file, then calls nread = read(fd, Fdata, Fsize) to read the first Fsize bytes (the full contents of the file) into the Fdata array; this also saves the number of bytes read from the file in nread.

When working with text files on a DOS system, nread will always be less than the size of Fsize because the Carriage Return + New Line pairs that DOS uses for line-endings will get converted from \r\n to just \n. So this method actually allocates a little more memory than needed, but that’s often an acceptable tradeoff.

What’s great about this method is that it stores the full contents of a file into a char array, which is just a giant string that’s big enough to keep the entire file in memory. After loading the file, you can access the Fdata string as you would any array. Just don’t forget to release the memory when the program is done with it.

Let’s demonstrate this method by writing a full program that uses open_file() to load a data file into memory, then print the contents of the array:

#include <stdio.h>
#include <stdlib.h>                    /* calloc, free */

#include <io.h>                        /* open, close, filelength */
#include <fcntl.h>                     /* O_RDNLY */

char *Fdata;
size_t Fsize;

size_t open_file(const char *filename)
{
  ...
}

int main()
{
    size_t i;

    if (open_file("data.dat") == 0) {
        puts("failed");
        return 1;
    }

    puts("file data:");
    for (i = 0; i < Fsize; i++) {
        printf("%c<%d>", Fdata[i], Fdata[i]);
    }
    puts("EOF");

    free(Fdata);

    return 0;
}

The program uses a loop to iterate through the data, and print the contents as both a regular character and its ASCII value. I did this to demonstrate that the Carriage Return (ASCII 13) + New Line (ASCII 10) pair get translated to just a single New Line.

In one example, the program might load a very short file like this:

K 4
K 1 ; K 2 ; K 3 ; K 4
201 + 202 + 203 + 204 / 101

For my sample data file, the program prints this output:

file data:
K<75> <32>4<52>
<10>K<75> <32>1<49> <32>;<59> <32>K<75> <32>2<50> <32>;<59> <32>K<75> <32>3<51> <32>;<59> <32>K<75> <32>4<52>
<10>2<50>0<48>1<49> <32>+<43> <32>2<50>0<48>2<50> <32>+<43> <32>2<50>0<48>3<51> <32>+<43> <32>2<50>0<48>4<52> <32>/<47> <32>1<49>0<48>1<49>
<10>EOF

The modern method

If you’re working on Linux or another modern Unix-like system, there’s another, more efficient method to load an entire data file into memory. The mmap() system call “maps” a file into memory, while providing memory protection and isolating any changes to the copy in memory so they don’t automatically get saved back to the file. In my case, my program only needs to read the file, and possibly modify the copy that’s stored in memory. I don’t want to alter the file that’s saved on disk.

And you can use mmap() in exactly this way. At a high level, this requires basically the same steps as before, but without allocating memory and replacing the read() system call with mmap():

  1. open() to open the file
  2. fstat() to get the size of the file in bytes
  3. map() to map the file into memory

Let’s keep the sample program more or less the same, so it’s easy to compare the two methods. To load the contents of a file into an array, I might create a function called open_file() that reads a file into a global string array called Fdata of size Fsize:

char *Fdata;
size_t Fsize;

size_t open_file(const char *filename)
{
    int fd;
    struct stat inf;

    fd = open(filename, O_RDONLY);
    if (fd < 0) {
        puts("cannot read file");
        return 0;
    }

    if (fstat(fd, &inf) != 0) {
        puts("cannot stat file");
        close(fd);
        return 0;
    }

    Fsize = (size_t) inf.st_size;

    Fdata = mmap(NULL, Fsize, PROT_READ, MAP_PRIVATE, fd, 0);

    if (Fdata == MAP_FAILED) {
        puts("cannot mmap file");
        close(fd);
        return 0;
    }

    close(fd);
    return Fsize;
}

The mmap() system call is a bit tricky, so let’s look at the options. The general usage of mmap() looks like this:

void* mmap(void addr, size_t len, int prot, int flags, int fd, off_t offset)
  1. The address addr to use for the mapping. Set this to NULL to let the kernel choose an address for the mapping (this is recommended).
  2. len is the size of the region that should be mapped. I’ve used the size of the file, so it maps the full file.
  3. prot provides the memory protections to use, like PROT_READ for read-only. See the mmap(2) manual page for other protections.
  4. flags indicates whether updates to the map should be visible to other processes, or if updates to the map should be saved back to the file. Using MAP_PRIVATE makes this a private copy-on-write mapping.
  5. fd is the file descriptor to read from.
  6. offset is the starting point. Use 0 for the start of the file.

You may have noticed the function closes the file after mapping it into memory. The mmap(2) manual page says that after the mmap() system call has returned, the file descriptor can be closed immediately without invalidating the mapping.

Mapping a file into memory is more efficient, and often faster, but still makes the full contents of a file available as a char array. After mapping the file, access the Fdata string as you would any array. Just don’t forget to end the mapping when the program is done working on the file.

Let’s demonstrate this method by writing a full program that uses open_file() to load a data file into memory, then print the contents of the array:

#include <stdio.h>

#include <fcntl.h>                     /* open */
#include <unistd.h>                    /* close */

#include <sys/stat.h>                  /* stat */
#include <sys/mman.h>                  /* mmap */

char *Fdata;
size_t Fsize;

size_t open_file(const char *filename)
{
  ...
}

int main()
{
    if (open_file("data.dat") == 0) {
        puts("failed");
        return 1;
    }

    puts("File data:");
    for (size_t i = 0; i < Fsize; i++) {
        printf("%c<%d>", Fdata[i], Fdata[i]);
    }
    puts("EOF");

    munmap(Fdata, Fsize);
    return 0;
}

Processing the same data file on my Linux system, but using Unix line endings, generates this output:

File data:
K<75> <32>4<52>
<10>K<75> <32>1<49> <32>;<59> <32>K<75> <32>2<50> <32>;<59> <32>K<75> <32>3<51> <32>;<59> <32>K<75> <32>4<52>
<10>2<50>0<48>1<49> <32>+<43> <32>2<50>0<48>2<50> <32>+<43> <32>2<50>0<48>3<51> <32>+<43> <32>2<50>0<48>4<52> <32>/<47> <32>1<49>0<48>1<49>
<10>EOF

Things to know

There are some limitations on both of these implementations. For example, mmap is only available on Linux and other Unix-like systems; you cannot use mmap on DOS. Instead, DOS programs can only load files using the first method, by storing the file in memory. But DOS has limited memory, so many DOS programmers are either careful about how much data they need to store, or they load only the parts they need from the file.

The first method of reading a file into memory is available on all platforms, but is less efficient on other systems like Linux. With mmap, you’re actually mapping your file access to memory, so the operating system only loads what it needs to. If your program needs to read a file and store its contents in memory like an array, using mmap is probably the better option on Linux.

Leave a Reply