3 ways to read files in C
When you’re just starting out with learning a new programming language, it’s good to stick to the basics until you have a more solid understanding of how the language works. With that foundation, you can move up to higher levels and more sophisticated algorithms to create more interesting programs.
That’s why when I write articles about learning how to write programs, I tend to stick to the basics. In these “entry level” articles, I don’t want to lose my audience, so I stick to simple programming methods that are easy to understand – even if they aren’t the most efficient way to do it. For example, to demonstrate how to write your own version of the cat
program, I might use stream functions like fgetc
to read a single character from the input and fputc
to print a single character on the output.
While reading and writing one character at a time isn’t a very fast way to print the contents of a text file, it’s simple enough that most new programmers can see what’s going on. Let’s look at three different ways that you could write a cat
program, at three different levels: easy but slow, simple and fast, and most efficient.
Starting a ‘cat’ program
The cat
command reads multiple files and concatenates them to the output, such as printing the contents to the user’s terminal. To implement the basics, we need a program called cat.c
that processes all the files on the command line, opens them, and prints their contents. Additionally, if the user didn’t list any files, we can read from standard input and copy that to standard output.
#include <stdio.h>
void cpytext(FILE * in, FILE * out);
int
main(int argc, char **argv)
{
FILE *in;
for (int i = 1; i < argc; i++) {
in = fopen(argv[i], "r");
if (in) {
cpytext(in, stdout);
fclose(in);
}
else {
fputs("cannot open file: ", stderr);
fputs(argv[i], stderr);
fputc('\n', stderr);
}
}
if (argc == 1) {
/* no input files, read from stdin */
cpytext(stdin, stdout);
}
return 0;
}
This is a very simple program that uses for
to iterate over the command line arguments, stored in the argv
array. The first item in the array (element 0) is the name of the program itself, so the for
loop actually starts at element 1 for the first command line argument. For each file, it opens the file, and uses cpytext
to print its contents to standard output.
We can write separate implementations of cpytext
to create new versions of the cat
program that use different methods to print the contents of text files.
Easy but slow: One character at a time
The stream functions in stdio.h
present a simple way to read and write data. We can use the fgetc
function to read one character at a time from a file, and fputc
to print one character a time to a different file. Writing a cpytext
function with these functions is just an exercise of reading with fgetc
and writing with fputc
until we reach the end of the file:
#include <stdio.h>
void
cpytext(FILE *in, FILE *out)
{
/* copy one character at a time */
int ch;
while ((ch = fgetc(in)) != EOF) {
fputc(ch, out);
}
}
This method is easy to explain: The cpytext
function takes two file pointers: one for the input and another for the output. cpytext
then reads data from the input, one character at a time, and uses fputc
to print it to the output. When fgetc
encounters the end of the file, it stops.
If we save that file as cpy1.c
then we can compile a new cat
program called cat1
like this:
$ gcc -o cat1 cat.c cpy1.c
Simple and fast: One line at a time
Reading and writing one character at a time is easy to explain, but the method is slow. Every time fgetc
reads a single character from a file, the operating system has to do a little extra work. We can be somewhat more efficient by reading and writing more data at once, such as working with one line at a time.
The getline
function from stdio.h
will read an entire string into memory at once. This is similar to the fgets
function, but with one important difference: where fgets
reads data into a variable of a fixed size, getline
can resize the array to fit the whole line into memory.
To use getline
, you first need to allocate memory to a pointer and set a variable to indicate the size. Or, don’t allocate memory (set the pointer to NULL
) and getline
will allocate memory on its own.
Using getline
requires more memory than fgetc
because it’s storing an entire line of text, but otherwise the basic algorithm is the same: Read a line of text from the input, then print that line to the output.
#include <stdio.h>
#include <stdlib.h>
void
cpytext(FILE *in, FILE *out)
{
char *line = NULL;
size_t size = 0;
ssize_t len;
while ((len = getline(&line, &size, in)) != -1) {
fputs(line, out);
}
free(line);
}
Note that getline
is meant to read text data, not copy data between files. But if the use case is to implement a cat
program that prints the contents of text files, we should be okay.
If we save that file as cpyline.c
then we can compile a new cat
program called catline
like this:
$ gcc -o catline cat.c cpyline.c
Most efficient: Read a block of data
One problem with using getline
to print the contents of a text file is when the program encounters a large file that has exactly one line. Then, the getline
function must read the entire file into memory before it can print it. That’s not a great way to use memory.
Instead, we can use the fread
function to read a block of data from a file at once, then use fwrite
to write the same block to a different file. To do this, we need to use the feof
function to tell us when we’ve reached the end of the file. Otherwise, the general algorithm is the same: Read from the input, then write to the output.
#include <stdio.h>
#define BUFSIZE 128
void
cpytext(FILE *in, FILE *out)
{
char buf[BUFSIZE];
size_t numread;
while (!feof(in)) {
numread = fread(buf, sizeof(char), BUFSIZE, in);
if (numread > 0) {
fwrite(buf, sizeof(char), numread, out);
}
}
}
The fread
function reads data into a buffer, called buf
, which has a fixed size of 128. fread
will read up to 128 characters from the input, and store them in buf
, then return a count of how many characters it actually read. We store that in a variable called numread
which we use with fwrite
to copy the contents of the buffer to the output.
If we save that file as cpybuf.c
then we can compile a new cat
program called catbuf
like this:
$ gcc -o catbuf cat.c cpybuf.c
How they compare
The basic algorithm remains the same across each implementation of cpytext
, although the details change: Read data from one file, and print it to another file. However, each version performs quite differently.
Let’s demonstrate how quickly each method can run by using cat
to copy the contents of a large text file. The /usr/share/dict/words
file contains a long list of words, which can be used by spell-checking programs. On my Fedora Linux system, this is a 4.8 MB file that contains almost a half million words:
$ wc -l /usr/share/dict/words
479826 /usr/share/dict/words
$ ls -H -sh /usr/share/dict/words
4.8M /usr/share/dict/words
The time
command will run a program and then print how much time that program needed to execute, broken down by “real” time (from start to finish), “user” time (CPU time) and “system” time (a different kind of CPU time). To time how long it takes to read the /usr/share/dict/words
file with the /bin/cat
command, and save the output to a temporary file called w
, we can type this:
$ time /bin/cat /usr/share/dict/words > w
To verify that the file didn’t change as we copied it with cat
, we can use the cmp
program; cmp
prints any differences between two files, and otherwise remains silent if they are the same. For example, to compare /usr/share/dict/words
with the w
file, type this:
$ cmp /usr/share/dict/words w
If cmp
doesn’t print anything, we know the two files are the same.
To compare the run times of each implementation, we can write a script to run each version and report the times. I’ve added the /bin/cat
program twice, at the start and at the end, because the operating system will “buffer” the contents of a file the first time we read it. We can then ignore the first /bin/cat
time, and use the second time.
#!/bin/sh
words=/usr/share/dict/words
echo '/bin/cat..'
time /bin/cat $words > w
cmp $words w
echo 'cat1..'
time ./cat1 $words > w
cmp $words w
echo 'catline..'
time ./catline $words > w
cmp $words w
echo 'catbuf..'
time ./catbuf $words > w
cmp $words w
echo '/bin/cat..'
time /bin/cat $words > w
cmp $words w
If we save this script as runall
, we can run it to compare each cat
implementation at once:
$ ./runall
/bin/cat..
real 0m0.007s
user 0m0.002s
sys 0m0.005s
cat1..
real 0m0.073s
user 0m0.047s
sys 0m0.016s
catline..
real 0m0.033s
user 0m0.015s
sys 0m0.008s
catbuf..
real 0m0.018s
user 0m0.004s
sys 0m0.006s
/bin/cat..
real 0m0.002s
user 0m0.000s
sys 0m0.002s
We can see that reading and writing one character at a time with fgetc
and fputc
(cat1
) was the slowest method, requiring 73 milliseconds to copy the 4.8 MB text file. Reading a line at a time using getline
(in catline
) was noticeably faster, at 33 milliseconds. But reading and writing a block of data at a time using fread
and fwrite
(catbuf
) was faster still, at only 18 milliseconds.
Our catbuf
implementation read 128 characters at a time, which is good, but still quite small. The program can run faster with a larger buffer. And the system /bin/cat
program uses this method iwth a much larger buffer, and takes virtually no time at all, only 2 milliseconds to read 4.8 MB of text data.
Slowing it down
You might wonder why bother if the difference is so small? My quad-core Intel(R) Core(TM) i3-8100T CPU @ 3.10GHz is certainly very fast, but consider the performance impact on slower systems.
Let’s run the same test on a slower system. I have a virtual machine running FreeBSD, which I use for testing. FreeBSD is actually a fast operating system, but since it’s running in a virtual machine, I can slow it down by running the virtual machine without KVM acceleration.
The /usr/share/dict/words
file is smaller on FreeBSD than on Linux, at just over 236,000 words. The file itself is 2.4 MB in size:
$ wc -l /usr/share/dict/words
236007 /usr/share/dict/words
$ ls -s -H /usr/share/dict/words
2496 /usr/share/dict/words
To make a more direct comparison between my fast Linux running on real hardware and my FreeBSD instance running on an artificially slow virtual machine, I’ll double the size of the text file by copying its contents twice to a new file called words
in my working directory. The new file approaches half a million words, and is about 4.8 MB in size; both measurements are about the same as on Linux:
$ cat /usr/share/dict/words /usr/share/dict/words > words
$ wc -l words
472014 words
$ ls -lh words
-rw-r--r-- 1 jhall jhall 4.8M May 15 14:37 words
I’ve compiled the same source files on FreeBSD, and this is my output when running the virtual machine without using KVM:
$ ./runall
/bin/cat..
0.04 real 0.00 user 0.04 sys
cat1..
2.85 real 2.71 user 0.11 sys
catline..
0.67 real 0.59 user 0.07 sys
catbuf..
0.15 real 0.08 user 0.05 sys
/bin/cat..
0.03 real 0.00 user 0.03 sys
Running FreeBSD without KVM simulates a much slower system, where we can see a more dramatic difference between these programs. Reading and writing one character at a time (cat1
) is quite slow, requiring 2.8 seconds to copy the 4.8 MB text file. But reading one line at a time with getline
(as catline
) is much better, at about 67 milliseconds of real time. Reading and writing 128 characters at a time (catbuf
) is faster still, at only 15 milliseconds to copy the 4.8 MB text file. The system /bin/cat
program uses the same method but with a larger buffer, so only needs 3 milliseconds to print the text file.
Teaching using simple methods
When I write “introductory” articles about how to get started in programming, I try to write my sample programs in a way that everyone can see what’s going on. But learning how to program for the first time is challenging enough without adding algorithms. I approach “writing your first program” as learn the basics first then you can move on to more advanced methods. So I might use fgetc
and fputc
to demonstrate how to write your own version of cat
on Linux or TYPE
on FreeDOS, even though there’s a better, faster way to do it.