|
ztrcpy()
and
ztrins()
:
A few important extensions for <string.h>
Adolfo Di Mare |
Some string manipulation functions provided in the C language standard
library are not safe. In this article functions
ztrcpy() ,
ztrins() ,
and others, are presented as an alternative to avoid some of the
inherent problems in the corresponding standard library functions. This
implementation should work in most C language environments as well as in
C++.
|
In his paper “Managed String Library for C”, Robert Seacord
[1] describes several
common string manipulation errors
[2] and many approaches that people have
used to work around the shortcomings of functions like
strcpy()
, because they allow unbounded string copies. The
companion standard function
strncpy()
takes an extra “size” argument, but
this function will not place the end of string mark in every case.
Hence, it makes sense to use a version of strcpy()
that
will always leave the result string zero terminated. I call this
function
ztrcpy()
where the leading “Z” is a remainder that this function
works in the same manner as strcpy()
but its
“size” parameter prevents unbounded memory overruns.
Function ztrcpy()
has the following signature:
char * ztrcpy( size_t size, /* sizeof(dest) */ char * dest, const char * src );
Function
strlcpy()
(with and “L”) is an alternative to
strcpy()
because it always zero terminates strings.
However, it returns as a
size_t
number the length of the source string, a value used
to determine whether string truncation occurred when copying. This is
confusing if one wants to “avoid unbounded copies” without
having to deal, after the facts, with “the size the string should
have”. Also, I do not like that the “size” parameter
for strlcpy()
is at the end because it seems more natural
to me to put it right where the destination string parameter is. Most
people will not read the fine letter in a prescription; in my opinion,
strlcpy()
has some quirks that are tough to grasp.
There is another replacement for strcpy()
, called strcpy_s()
, but it returns an error number code using the
opaque errno_t
type. Again: why force the programmer to check codes
after each invocation? In my experience, what is important is to avoid
memory overwrites (truncated strings due to lack of memory are easier to
spot because they show up with less characters).
A while ago I used a C library that implemented function
strins()
to insert a string inside another. I used it in a
few programs, but there were times when it would produce unbounded
string copies (my fault: I used small strings!). At first, it was hard
to spot the error because the programs behavior would be very strange
(sometimes it would cycle back to the beginning of the routine, probably
because of run time stack corruption). I have learned to trust the
compiler more than myself, but those memory overruns were not fun to
deal with. I wanted to “fix” strins()
, but
never took the time to do it, until I came across Seacord's article in
DrDobbs Journal. This is why I
implemented
ztrins()
,
as a memory safe version of
strins()
. Function ztrins()
has the following
signature:
char * ztrins( size_t size, /* sizeof(dest) */ char * dest, size_t n, /* insertion point */ const char * insert); /* insertion string */
I use C++ as my development tool. Since the functions I proposed
belong to the C language, I decided to write the whole implementation in
C. At first I planned on implementing many functions, but after a little
thinking I decided to write as few as possible. At last, I implemented
the “size” version for strcpy()
and
strcat()
,
[
ztrcpy()
,
ztrcat()
],
three “size checked” functions to manipulate strings
[
ztrins()
,
strdel()
,
ztrsub()
],
another three functions to remove leading and trailing characters
[
strltrim()
,
strrtrim()
,
strtrim()
],
one function to remove characters from a memory block
[
memczap()
],
a couple of functions to figure out the prefix and suffix in a string
[
strpfx()
,
strsffx()
],
and a couple of functions to remove the accent in the Latin 1 accented letters
[
strxltn1()
,
strxacct()
].
The following example illustrates the usage of these functions
(eqstr(a,b)
compares two C strings):
{{ /* test::ztrins() */ char s30[30]; /* 123456789.123456789.1 -> 21 chars */ ztrcpy( sizeof(s30),s30, "====!-----+.........+" ); { { ztrins( sizeof(s30),s30, 4, "_2_4_"); } } /* [4] <-> s30+(4) */ /* /!\ */ assertTrue( eqstr(s30, "====_2_4_!-----+.........+") ); assertTrue( 26 == strlen("====_2_4_!-----+.........+") ); { { { assertTrue( 21+strlen("_2_4_") == strlen(s30) ); } } } { /* replace JIM with ROMEO */ char *p; char poem[] = "JIM, JIM, JIM ... Where are you?"; while ( 0!=(p=strstr(poem,"JIM")) ) { strdel( p, strlen("JIM") ); ztrins( sizeof(poem),poem, p-poem, "ROMEO" ); } assertTrue( eqstr(poem,"ROMEO, ROMEO, ROMEO ... Where ar") ); assertTrue( strlen("JIM")<strlen("ROMEO") ); /* -> truncation */ assertTrue( strlen("ROMEO, ROMEO, ROMEO ... Where ar") == strlen("JIM, JIM, JIM ... Where are you?") ); } ztrcpy( sizeof(s30),s30, "====!-----+.........+" ); /* -> 21 chars */ { { ztrins( sizeof(s30),s30, 00, "________18________"); } } /* [0] */ assertTrue( eqstr(s30, "________18________====!-----+") ); assertTrue( strlen(s30) == sizeof(s30)-1 ); /* max size */ ztrcpy( sizeof(s30),s30, "0123456789" ); { { ztrins( /*size->*/1,s30, 0, "" ); } } assertTrue( eqstr(s30, "") ); /* (size==1) ==> (s30[0]==0) */ assertTrue( eqstr(s30+1 , "123456789" ) ); }}
String 's30
' can hold up to 29
characters. First, it is initialized using the memory safe
ztrcpy()
function. Then, ztrins()
inserts
"_2_4_"
at position [4]
and, as the result
fits within the size of 's30
', there is no truncation.
Later, the word "JIM"
in string 'poem
' is
replaced with "ROMEO"
, but as the size of
'poem
' is determined at compile time, when the longer word
"ROMEO"
is put in place, the last letters in
'poem
' need to be truncated. The difference in length from
"JIM"
to "ROMEO"
is 2
letters,
and as three instances of "JIM"
get substituted, the last
2×3
letters are left out (trailing substring
"e you?"
has 6
characters).
Maybe a novice programmer would have a little difficult figuring out
why ROMEO's poem got truncated, but at least no memory override would
happen. If the non size checking versions of these functions were used,
it would be hard to find the bug if the string was not big enough. It is
true that the implementation for ztrcpy()
and
ztrins()
require more time to check parameters and
boundaries, but in most applications the speed differences can only be
measured in millions of seconds, which is negligible. Also, there is no
standard
strins()
in the string.h
header file. This is a
summary of the implemented string routines:
/* ztring.h (C) 2014 adolfo@di-mare.com */ /* 'size' checked versions of 'strcpy()' && 'strcat()' */ char* ztrcpy( size_t size, char * dest, const char * src ); char* ztrcat( size_t size, char * dest, const char * src ); /* insert, delete and substring (with 'size' check) */ char* ztrins( size_t size, char * dest, size_t n, const char *insert ); char* strdel( char * dest, size_t len ); char* ztrsub( size_t size, char * dest, const char *src, size_t len ); /* trim 'str' left, right and both */ char* strltrim( const char * str , char tr ); char* strrtrim( char * str , char tr ); char* strtrim( char * str , char tr ); /* remove character 'ch' from memory block 'mem' */ size_t memczap( size_t size,void *mem, int ch ); /* string prefix and suffix (boolean) */ int strpfx( const char *str, const char *prefix ); int strsffx( const char *str, const char *suffix ); /* Get span until character in character range '[a..z]' */ size_t strrspn( const char * str, char a, char z ); /* transform to ASCII accented letters in Latin 1 alphabet */ char strxltn1( char accented_latin_1 ); char* strxacct( char* str );
As the size parameter in these functions precedes the destination
string, it is easy to tie them together. For example, the following
macro ZS()
can be used to homologate strcpy()
code with ztrcpy()
:
#ifdef USE_ZTR #define ZS(x) sizeof(x),x /** Shortcut macro */ #else #define ZS(x) x /* convert ztrcpy(ZS(dest),src) -> strcpy(dest,src) */ #define ztrcpy( dest, src) strcpy(dest,src) /* convert ztrcat(ZS(dest),src) -> strcat(dest,src) */ #define ztrcat( dest, src) strcat(dest,src) #endif
Many have argued that using macros is a bad idea
[3], but in this case a simple macro
like ZS()
helps the transition from the unsafe into the
safer version of each function.
It is easy to find in the net some implementations of functions
that are similar to ztrcpy()
and ztrcat()
:
http://google.com/search?as_qdr=all&num=100&as_q=strlcpy+code http://google.com/search?as_qdr=all&num=100&as_q=strlcat+code
However, implementations to insert a string into another are not as plentiful:
http://google.com/search?as_qdr=all&num=100&as_q=c+string+insert+code http://search.yahoo.com/search?n=100&p=c+string+insert+code http://www.bing.com/search?q=c+string+insert+code
The implementation for
ztrins()
requires a little bit of care because it is easy to fall into an
unbounded string copy. Moreover, special care is needed to handle many
limit cases. This is a “size checked” C language
implementation to insert one string into another:
char* ztrins( size_t size, char * dest, size_t n, const char * insert ) { if ( dest==NULL || size==0 ) { return dest; } else if ( size==1 ) { *dest=0; return dest; } else { /* ( size>=2 ) */ size_t inslen, destlen = strlen( dest ); --size; /* max length for 'dest' */ if ( destlen>size ) { destlen = size; } if ( n>size || n>destlen || insert[0]==0 ) { dest[size] = 0; return dest; } inslen = strlen( insert ); if ( size <= n+inslen ) { /* the whole 'insert' does not fit */ memmove( &dest[n] , insert, (size-n) ); } else { /* first move tail to the right */ if ( size <= destlen+inslen ) { /* only a piece fits */ memmove( &dest[n+inslen], &dest[n], (size-(n+inslen)) ); } else { /* insert the whole thing */ memmove( &dest[n+inslen], &dest[n], (destlen-n) ); size = destlen+inslen; } memmove( &dest[n] , insert , inslen ); /* insert */ } dest[size] = 0; } return dest; }
The implementation of all the functions in
ztring.c
will force
the string length to fit within its size. This means that very long
strings will have its length adjusted according to the 'size' parameter
received by the routine. The following code illustrates this:
char dest[15]; { ztrcpy( sizeof(dest),dest, "012345" ); assertTrue( eqstr( dest, "012345" ) ); ztrins( /*size->*/ 1,dest, 0, "abc" ); assertTrue( eqstr( dest, "" ) && eqstr( 1+dest, "12345" ) ); } { ztrcpy( sizeof(dest),dest, "012345" ); assertTrue( eqstr( dest, "012345" ) ); ztrins( /*size->*/ 3,dest, 0,"abc" ); assertTrue( eqstr( dest, "ab" ) && eqstr( 3+dest,"345" ) ); }
In the first block of code, the 'size
' parameter used
to invoke ztrins()
is '1
' (one), which leaves
no space to store any letters within the string. At run time, there is
no way to figure out that the size for 'dest
' is bigger
than one, but nonetheless ztrins()
zero terminates
'dest
' to force it to be a string that can fit in a
character array of 'size
' characters.
In the second block the value stored in 'dest
' is a
string that has more than '3
' characters. When function
ztrins()
is invoked with a 'size
' parameter
with value '3
', ztrins()
puts the end of
string marker that makes the value stored within 'dest
'
less than three characters long. In this case, after invoking
ztrins()
the value stored in 'dest
' will be
bounded to the 'size
' parameter used in the invocation.
This behavior can help to fix errors in some implementations.
Both the
Apache Portable Runtime (APR)
[4] and the
GNOME Library
[5] provide functions similar to
ztrcpy()
(but there are none like
ztrins()
).
The differences are very subtle and deserve discussion. Both functions
apr_cpystrn() and
g_strlcpy() receive the 'size
' parameter as the last
one, whereas ztrcpy()
has it as its first parameter.
Function g_strlcpy()
is a portability wrapper used to call
strlcpy()
and it will always zero terminate the destination
string. Function apr_cpystrn()
returns a pointer to the end
of string 'NUL
' character as a means to check whether the
copied string was truncated because it did not fit in the destination.
To accomplish the same result when using ztrcpy()
code
similar to the following should be used:
{ /* detect truncation using 2 invocations to strlen() */ ztrcpy( sizeof(dest),dest, src ); if ( strlen(dest) < strlen(src) { take_action( "truncation ocurred" ); } /* apr_cpystrn() is faster because it requires only one invocation */ /* to strlen(), which always examines all characters in the string */ if ( apr_cpystrn(dest,src,sizeof(dest)) - dest < strlen(src) ) { take_action( "truncation ocurred" ); } /* g_strlcpy() returns the length of the source string */ if ( g_strlcpy(dest,src,sizeof(dest) ) >= sizeof(dest) ) { take_action( "truncation ocurred" ); } }
Function g_strlcpy()
does not return a pointer, but a
number that can be used to detect truncation. It is hard to debate what
is best, but I decided to make ztrcpy()
as similar as
posible to strcpy()
to help programmers substitute the
later with the former. The approach taken with g_strlcpy()
seems less convoluted than that of apr_cpystrn()
: further
discussion can be found in
[6].
There are other functions that might be useful at times. The three
trimming functions,
[
strltrim()
,
strrtrim()
,
strtrim()
],
can help in removing leading or trailing characters. For simplicity,
they take a single letter as parameter because usually the trimming is
done over blanks. Instead of shifting the string value to the left, what
strltrim()
(trim left) does is return a pointer after all
the trimmed characters; this pointer can be used to move around the rest
of the string. If the desired behavior is to move the suffix of the
string left, a simple memmove()
invocation can be used:
memmove( str, strltrim(str,' '), 1+strlen(str) );
There are also two functions to determine if a string is the prefix
or suffix of another. A little more interesting is function
memczap()
that scans a block of memory and removes a character, moving left the
other characters. For example, if a string contains
"(*:**-*)"
, after removing the asterisk '*'
,
the value stored in the string will be "(:-)"
. This is
cute.
There are programmers who do not have problems using pointer
arithmetic to manipulate strings (I am not one of them). I wrote
test_ztring.c
,
a simple unit test program for these functions, but after getting all of
them to do what I expected, I still was not sure if my implementation
was free from memory overrun errors. I
looked around the
net for tools to help on making sure that my code
did not have any unbound string copies, but at last I decided to write
my own bound checker as C++ template class
zchz<>
. I
twinkled my code to use only simple string declarations that can be
transformed easily to use my template class. After that, I used Tormod Tjaberg's program
GSAR
to transform each declaration
[7], following a pattern like this:
C → char dest[15] char s15[15] sizeof(dest) sizeof(s15) C++ → zchz<15> dest zchz<15> s15 dest.strsz() s15.strsz()
I named my class zchz<>
to preserve the same
spacing from program test_ztring.c
into
test_ztring.cpp
(this is the C++ bound checking version of
the program). As
it is invalid to overload sizeof()
in C++
[8], I included method
zchz<>::strsz()
to get the value that
sizeof()
would have returned if it was overloaded (again, I
named this method "strsz()
"
"STRing SiZe" to
preserve spacing within the test program source code).
Any zchz<>
variable contains three memory blocks
that can hold no more than '200
' characters (this value is
hard coded, but it can be changed if bigger strings are required for
testing). The middle one is used to store a string value, and the other
two are used to hold a bit pattern. Whenever an unbounded memory copy
occurs, either the left or right block would be corrupted: this event
can be discovered the next time that any method of
zchz<>
is used. When running the test program
step-by-step with the symbolic debugger it is easy to pinpoint where
each failure occurs. It would be more useful if the exact location of
the failure would be reported by zchz<>
, but usually
a test program displays stuff only on failure and, after all test
failures get fixed, the test program no longer displays failure messages
produced by methods from class zchz<>
. If no failure
messages get displayed it means that all test cases where successful.
Unit tests are designed to exercise every feature in a program, but they can never be exhaustive. Hence, when a test program finds no failure it does not mean that the program is correct because the program can still have bugs that were not uncovered by the test data. As we cannot work forever producing test cases, most of us stop testing when a reasonable amount of test cases show no failure.
It is not very difficult to improve on some of the functions that
cause many problems to C programmers. Moreover, the functions presented
are very similar to their less safe versions, which might convince some
hard core programmers to take a look at them (it could also happen that
they get enough recognition to be a part of the standard language). If
programmers decide to use other functions that are already included in
other libraries, maybe they can take a look to function
ztrins()
which is seldom implemented elsewhere. The source code is available here:
http://www.di-mare.com/adolfo/p/ztring/ztring.zip
Alejandro Di Mare and David Chaves made valuable suggestions that helped
improve earlier versions of this work. The spelling and grammar were
checked using the http://spellcheckplus.com/
tool.
The Graduate Program in Computación e Informática, the
Escuela de Ciencias de la Computación e Informática and
the Universidad de Costa Rica provided funding for this research.
ztring.zip
source code: http://www.di-mare.com/adolfo/p/ztring/ztring.zip
ztring.c
: A few important extensions for <string.h>
http://www.di-mare.com/adolfo/p/ztring/ztring_8c.html
uUnit.h
:
assertTrue()
&& assertFalse()
http://www.di-mare.com/adolfo/p/ztring/uUnit_8h.html
ftp://ftp.stack.nl/pub/users/dimitri/doxygen-1.8.6-setup.exe
[1] | Seacord, Robert:
Managed String Library for C,
Dr.Dobbs: The World of Software Development,
October 01, 2005.
http://www.drdobbs.com/cpp/184402023
http://www.drdobbs.com/article/print?articleId=184402023
|
[2] | Seacord, Robert:
Secure Coding in C and C++: Strings,
published bythe Addison-Wesley Professional,
SEI Series in Software Engineering,
September 9, 2005
Chapter available in:
http://www.informit.com/articles/article.aspx?p=430402&seqNum=2
|
[3] | Stroustrup, Bjarne:
So, what's wrong with using macros?,
in Bjarne Stroustrup's C++ Style and Technique FAQ, 2012.
http://www.stroustrup.com/bs_faq2.html#macro
|
[4] | The Apache Software Foundation:
Apache Portable Runtime, 2014.
http://apr.apache.org/
|
[5] | GNOME Developer:
GNOME Library, 2014.
http://developer.gnome.org/
http://ftp.gnome.org/pub/gnome/sources/glib/
|
[6] | Miller, Todd C. & de Raadt, Theo:
strlcpy and strlcat - consistent, safe, string copy and concatenation,
in 1999 USENIX Annual Technical Conference. Monterey, California, USA, June 6–11, 1999.
http://static.usenix.org/event/usenix99/full_papers/millert/millert.pdf
http://www.courtesan.com/todd/papers/strlcpy.html
|
[7] | Tjaberg, Tormod:
gsar121.zip ,
2008.
http://home.online.no/~tjaberg/
http://home.online.no/~tjaberg/gsar121.zip
http://gnuwin32.sourceforge.net/packages/gsar.htm
|
[8] | Stroustrup, Bjarne:
Why can't I overload dot, ::, sizeof, etc.?,
in Bjarne Stroustrup's C++ Style and Technique FAQ, 2012.
http://www.stroustrup.com/bs_faq2.html#overload-dot
|
[-] | Abstract |
[1] | Motivation |
[2] | Funcionality |
[3] | Implementation |
[4] | Mandatory string size compliance |
[5] | Specification details |
[6] | Other useful functions |
[8] | Testing |
[8] | Conclusions |
[9] | Aknowledgments |
[10] | Source code |
|
|
Bibliografía | |
Indice | |
Acerca del autor | |
Acerca de este documento | |
Principio Indice Final |
Adolfo Di Mare: Investigador costarricense en la Escuela de Ciencias de la Computación e Informática [ECCI] de la Universidad de Costa Rica [UCR], en donde ostenta el rango de Profesor Catedrático. Trabaja en las tecnologías de Programación e Internet. También es Catedrático de la Universidad Autónoma de Centro América [UACA]. Obtuvo la Licenciatura en la Universidad de Costa Rica, la Maestría en Ciencias en la Universidad de California, Los Angeles [UCLA], y el Doctorado (Ph.D.) en la Universidad Autónoma de Centro América. |
Adolfo Di Mare: Costarrican Researcher at the Escuela de Ciencias de la Computación e Informática [ECCI], Universidad de Costa Rica [UCR], where he is full professor and works on Internet and programming technologies. He is Cathedraticum at the Universidad Autónoma de Centro América [UACA]. Obtained the Licenciatura at UCR, and the Master of Science in Computer Science from the University of California, Los Angeles [UCLA], and the Ph.D. at the Universidad Autónoma de Centro América. |
Referencia: | Di Mare, Adolfo:
ztrcpy()
and
ztrins() :
A few important extensions for <string.h>
:
Technical Report 2014-01-ADH,
Escuela de Ciencias de la Computación e Informática,
Universidad de Costa Rica, 2014.
|
Internet: |
http://www.di-mare.com/adolfo/p/ztring.htm
Google™
Translate
http://www.di-mare.com/adolfo/p/ztring.pdf
Google™
Translate
http://www.di-mare.com/adolfo/p/ztring/ztring.zip
|
See Also: |
http://www.drdobbs.com/cpp/232700238
|
Autor: | Adolfo Di Mare
<adolfo@di-mare.com>
|
Contacto: | Apdo 4249-1000, San José Costa Rica Tel: (506) 2511-8000 Fax: (506) 2438-0139 |
Revisión: | ECCI-UCR, March 2014 |
Visitantes: |