Motivation
Prior to the C++ string
class (i.e., in C), strings were represented in a relatively low-level, but quite natural
manner: as arrays of ASCII characters: char []
, or equivalently char *
.
- Until the introduction of the C++
string
class, the term string referred exclusively to such arrays; with the advent of that class,
the term C-style string was introduced to distinguish the two.
- As these arrays of characters were intended to represent strings of text, the expectation was that only the printable ASCII characters
(essentially 95 characters — from ASCII 0x20 (32decimal ' ') through ASCII 0x7E (126decimal / '~') would appear in such strings.
- A consequence of strings only containing the ASCII printable characters meant that C-style strings could use any of these non-printable characters as a valid
trailer value (since they would never appear as an actual data element in the array).
- In particular, ASCII 0x00 ('\0', integer value 0) was chosen because it also represented false in C/C++, and thus led to some 'elegant' and highly terse string processing
idioms (we'll see examples of this below).
Overview
A
C string is:
- sequence of characters (char) terminated with a null-byte ('\0')
- declared as
String Literals
- Sequence of characters enclosed in
"
's
- Compiler allocates space corresponding to characters in literal plus null byte
- Thus "Hello" is equivalent to {'H', 'e', 'l', 'l', 'o', '\0'}
- type is
const char[n]
where n is the number of characters in the literal + 1 for the null byte terminator
- Compiler 'returns' pointer to 'anonymous string's allocated space
s = "Hello";
char []
vs char *
Recall that although an array is passed to a function as a pointer, there is a difference between declaring an array and a pointer:
- the array declaration allocates space for the specified elements and the name is a constant pointer to the first element
(i.e., can be reassigned to point elsewhere
- the pointer declaration allocates a (typically four byte) pointer variable that must be assigned an address, but can be reassigned
to another address
char []
- Array-storage is allocated for the specified number of elements (explicitly or implicitly)
- size of array (string) is thus fixed at compile-time
- (as with any array, the array name is a constant
- various ways of initializing the array
- uninitialized — an empty buffer waiting to be filled
char name[80];
- initialized as an array
char s[] = {'H', 'e', 'l', 'l', 'o', '\0'};
- emphasizes
s
as an array of characters
- not a very common way of initializing the string (clumsy)
- as usual, compiler infers size of array from initializer
- no additional room for growth (e.g. concatenation or copy of a larger string)
- initialized as a string
char s[] = "Hello";
- emphasizes
s
as string
- compiler adds null byte (so length is still 6)
- same size issues as above
- initialized with room to grow
char s[80] = "Hello";
- compiler adds null byte (so length is still 6)
- capacity is now 80 (for a max of a 79 character string)
char *
- Pointer-storage is allocated for a pointer variable only
- The pointer can then be assigned the addresses of various strings
Working With C Strings
There are several issue related to working with C string — especially to Java and C++ programmers who are accustomed
to using a self-managed string class.
- Due to their having a null terminating byte (i.e., a trailer vale within the array), C strings are processed differently
than most other arrays (where a separate length/size value determines the number of elements in the array).
- In particular, one typically processes a C string using a
while
loop since termination of the iteration is controlled
by a condition (s[i] == '\0') rather than a length (which is the case for most arrays and why most arrays are typically processed
using a for
loop).
- this is a parallel to the classical trailer/header techniques for reading input:
- C-style strings correspond to a input sequence with a distinguished (i.e., special) trailing value at the end of the date,
- other arrays correspond to an input sequence prefixed by a header value; the only (minor) difference is that in the input
sequences the header value is part of the data stream (typically coming immediately before the data sequence, whereas in the
array, the header value is typically maintained in a separate variable and passed around as a separate parameter (or data member
for a container class).
- Since C strings are essentially built-in, primitive arrays, they have a fixed length — it may be a dynamically calculated length, and
the array allocated at run time, but the length is still fixed. (That's the whole point of the string class — it encapsulates the
built-in array within the class and manages the allocation f sufficient space, using a method such as the
checkCapacity
function presented in class.
- This means the programmer employing C strings must always be aware of the capacity (i.e., physical size / maximum number of characters) in
the array (also sometimes called the buffer.
- An example of the issue: when concatenating one string,
s2
to the end of another, s1
one must make sure that the array
containing s1
contains sufficient characters to hold s2
after s1
. If not, the characters of s2
will overflow s1
's buffer. And remember, there's no ArrayBoundException
here!
- Finally, the programmer constructing or otherwise manipulating C string, must often remember to ensure a null terminator byte is present at the logical end of the buffer;
i.e., after the last significant character in the array.
- The C++
<<
operator is overloaded to 'recognize' C-strings and print them out as strings (rather than an array of characters).
In summary, there is a lot of 'stuff' to deal with; which is one of the major reasons a
string
class is so desirable.
To help address and minimize the consequences of not taking these issues into account, C (and thus C++) provides a library of C string functions to aid in the
processing of C strings. These are accessed via the cstring
header file.
Examining the Functions in the cstring
C String Library
- Here's an online reference to the
C Standard library.
- Basic string-manipulation functions
int strlen(char *str);
Returns length (number of characters un to but no including null byte) starting at character
pointed to by str
char *strcpy(char *to, const char *from);
Copies string (characters until and including null byte from from
to to
; returns pointer to
the destination string (i.e., to
)
char *strcat(char *str1, const char *str2);
Concatenates (appends) characters of str2
to end of str1
; returns str1
int strcmp( const char *str1, const char *str2 );
Compares str1
and str2
character-by-character. Returns:
-1
if str1 < str2
0
if str1 == str2
1
if str1 > str2
char *strchr( const char *str, int ch );
Returns a pointer to the first location in str
that contains ch
.
If ch
does not appear in the string, NULL
is returned.
- pointer arithmetic (
-
) is often used to get a relative position within the string
char *strrchr( const char *str, int ch );
Returns a pointer to the last location in str
that contains ch
.
If ch
does not appear in the string, NULL
is returned.
- I/O-related functions
char *gets( char *str );
Reads characters from stdin
into str
until newline or EOF
,
terminating the string with a null byte. A pointer to str
is returned, or NULL
if an error occurred (including reading past eof).
int puts( char *str );
Outputs the characters in str
to stdout
Returns a value > 0 on success,
EOF
on error.
- Functions that Prevent Buffer Overflows
char *strncpy( char *to, const char *from, size_t count );
Works like strcpy
except at most count
characters are copied.
If from
contains less than count
characters, to
is padded with '\0\
(null bytes).
char *strncat( char *to, const char *from, size_t count );
Works like strcat
except at most count
characters are appended followed
by a null byte.
char *fgets( char *str, int num, FILE *stream );
Works like gets
except:
- Input comes from the FILE * parameter rather than
stdin
- Characters are read in until either
- A newline is encountered, in which case it is placed into the string
- EOF is encnountered
num
-1 characters have been read in
A null byte is always added to the end of the string. str
is returned, or NULL
on error.
int fputs( const char *str, FILE *stream );
Works like puts
except the output is sent to the FILE *
Understanding the String Idioms
There are several fairly unique C-originated idioms that arise when working with null-terminated C strings. These are based on the following C'isms:
- The null terminator, being 0 ('\0') is also false
- An array may be iterated using a pointer just as easily as using a subscript
- C permits characters to be assigned within the condition of a loop, and then also be used as the value of the condition
Checking for a null byte
Again, processing C-strings is like performing trailer-value-based input: one uses a while (a conditional loop)
using the null byte ('\0') as the terminating condition (trailer value). The most straightforward way to check
for the null byte (ASCII 0,
'\0'
) is to write:
*s == '\0'
- since '\0' is also the false value, this is the same as writing the 'pure' boolean expression:
!*s
i.e., the value pointed to by s is not true (i.e., it's false, or '\0')
Similarly, the condition for a character NOT the null byte is:
*s != '\0'
- and by the same reasoning as above:
*s
i.e., the value pointed to by s is true (i.e., it's NOT false, or '\0')
Iterating Through a C-String
Iterating through
s
is accomplished with the following pattern:
while (*s) { // while not yet at the null byte
… // process the current character (*s
);
s++; // go to next character
}
Recursing Through a C-String
Similarly, recursing through
s
uses the following pattern:
void f(char *s) {
if (!*s) return; // null byte is escape clause
… // process current character (*s
)
f(s+1) // recurse to next character
}
`
Copying a sequence of successive characters
Here is the straightforward way of moving through
s2
and copying each character to the
corresponding position pointed at by s1. After each character is copied the pointers are incremented
to the next position, the sequence terminating when the null byte is encountered in
s2
while (*s2) {
*s1 = * s2;
s1++;
s2++;
}
*s1 = '\0'; // it ain't a C-string without the null byte
- The null byte is NOT copied in the loop — the loop is not entered when
*s
is the null byte,
and thus the copy of that byte is never done
- The destination C-string (which was built using
s1
is therefore completed after the loop by terminating it with
a null byte
The above can be rewritten in the highly terse, yet elegant code:
while (*s1++ = *s2++) // assigns the charcter pointed to by s2 to the location pointed to by s1, and bumnps up both pointer; stops on '\0'
;
- merging the inscrement and the indirections, the assignment and move-to-next-position are consolidated for
s1
and s2
- note the operator in the condition is an assignment NOT an equality … this is where the assignment of
s2
characters to s1
occurs.
- the result of the assignment (the character just copied) becomes the value of the condition, and thus the process
is terminated when the null byte is reached , but after the assignment is performed (i.e., AFTER the null byte has been copied to
s1
One common issue when copying is maintaining a pointer to the beginning of the string; note how
s1
and
s2
both move down the string; if the beginning of the string is important in the current context, their loctions
must be saved. This will be seen in the examples below.
Implementing the String Functions Using the Various Idioms
strlen
int strlen_1(char *s) {
int count = 0;
while (*s != '\0') {
count++;
s++;
}
return count;
}
Notes
- Maintaining a count variable
- Tests for null byte (test is explicit)
- Uses pointer arithmetic
int strlen_2(char s[]) {
int i = 0;
while (s[i] != '\0')
i++;
return i;
}
Notes
- Uses subscripting
- Index and counter are same variable
int strlen_3(char *s) {
int i;
for (i = 0; s[i] != '\0'; i++)
;
return i;
}
Notes
- Note how the
for
loop is being used in a similar manner as a while
int strlen_4(char *s) {
char *p = s;
while (*p++)
;
return p - s - 1;
}
Notes
- Pure pointer arithmetic this time
- No counter variable is used-- pointer difference (subtraction) is used instead
- Note how auto-increment,
++
is used in same expression as *
- Note how the test for the null byte is implicit
int strlen_5(char *s) {
char *p;
for (p = s; *p; p++)
;
return p - s;
}
Notes
- Another example of the
for
loop used similarly to a while
int strlen_rec1(char *s) {
if (!*s) return 0;
return strlen_rec1(s+1) + 1;
}
Notes
- Recursive
- Length defined as
- length of empty string (
!*s
) is 0
- length of all other string is 1 + the length of the string after the first
character (i.e., from the 2nd character on)
- Notice the absence of counter local variable
int strlen_rec2(char *s) {
return *s ? strlen_rec2(s+1) + 1 : 0;
}
Notes
- Just another application of the conditional (
?:
) operator
strcpy
char *strcpy_1(char *to, const char *from) {
char *originalTo = to;
while (*from) {
*to = *from;
to++;
from++;
}
*to = '\0';
return originalTo;
}
Notes
- Pointer arithmetic again
- Note the terminating assignment of a null byte
- Note the need to assign the original value of
to
for the subsequent return of
a pointer to the beginning of the destination string
char *strcpy_2(char *to, const char *from) {
char *originalTo = to;
while (*to++ = *from++)
;
return originalTo;
}
Notes
- Again, the combination of autoincrement and pointer dereferencing (
*
- Note no explicit null byte assignment here
char *strcpy_3(char *to, const char *from) {
int i = 0;
while (from[i]) {
to[i] = from[i];
i++;
}
to[i] = '\0';
return to;
}
Notes
- Subscripting
- Note no need for temporary hold pointer, however, a local index is used
char *strcpy_rec(char *to, const char *from) {
*to = *from;
if (*from) strcpy_rec(to+1, from+1);
return to;
}
Notes
- Recursive
- To copy:
- Copy first character
- If not at end (i.e., null byte) (recusrively) copy the rest of the string
- Notice absence of local variables to hold original pointer or index
Command Line Arguments