mirror of git://gcc.gnu.org/git/gcc.git
				
				
				
			
		
			
				
	
	
		
			483 lines
		
	
	
		
			20 KiB
		
	
	
	
		
			XML
		
	
	
	
			
		
		
	
	
			483 lines
		
	
	
		
			20 KiB
		
	
	
	
		
			XML
		
	
	
	
| <chapter xmlns="http://docbook.org/ns/docbook" version="5.0" 
 | |
| 	 xml:id="std.strings" xreflabel="Strings">
 | |
| <?dbhtml filename="strings.html"?>
 | |
| 
 | |
| <info><title>
 | |
|   Strings
 | |
|   <indexterm><primary>Strings</primary></indexterm>
 | |
| </title>
 | |
|   <keywordset>
 | |
|     <keyword>ISO C++</keyword>
 | |
|     <keyword>library</keyword>
 | |
|   </keywordset>
 | |
| </info>
 | |
| 
 | |
| <!-- Sect1 01 : Character Traits -->
 | |
| 
 | |
| <!-- Sect1 02 : String Classes -->
 | |
| <section xml:id="std.strings.string" xreflabel="string"><info><title>String Classes</title></info>
 | |
|   
 | |
| 
 | |
|   <section xml:id="strings.string.simple" xreflabel="Simple Transformations"><info><title>Simple Transformations</title></info>
 | |
|     
 | |
|     <para>
 | |
|       Here are Standard, simple, and portable ways to perform common
 | |
|       transformations on a <code>string</code> instance, such as
 | |
|       "convert to all upper case." The word transformations
 | |
|       is especially apt, because the standard template function
 | |
|       <code>transform<></code> is used.
 | |
|    </para>
 | |
|    <para>
 | |
|      This code will go through some iterations.  Here's a simple
 | |
|      version:
 | |
|    </para>
 | |
|    <programlisting>
 | |
|    #include <string>
 | |
|    #include <algorithm>
 | |
|    #include <cctype>      // old <ctype.h>
 | |
| 
 | |
|    struct ToLower
 | |
|    {
 | |
|      char operator() (char c) const  { return std::tolower(c); }
 | |
|    };
 | |
| 
 | |
|    struct ToUpper
 | |
|    {
 | |
|      char operator() (char c) const  { return std::toupper(c); }
 | |
|    };
 | |
| 
 | |
|    int main()
 | |
|    {
 | |
|      std::string  s ("Some Kind Of Initial Input Goes Here");
 | |
| 
 | |
|      // Change everything into upper case
 | |
|      std::transform (s.begin(), s.end(), s.begin(), ToUpper());
 | |
| 
 | |
|      // Change everything into lower case
 | |
|      std::transform (s.begin(), s.end(), s.begin(), ToLower());
 | |
| 
 | |
|      // Change everything back into upper case, but store the
 | |
|      // result in a different string
 | |
|      std::string  capital_s;
 | |
|      capital_s.resize(s.size());
 | |
|      std::transform (s.begin(), s.end(), capital_s.begin(), ToUpper());
 | |
|    }
 | |
|    </programlisting>
 | |
|    <para>
 | |
|      <emphasis>Note</emphasis> that these calls all
 | |
|       involve the global C locale through the use of the C functions
 | |
|       <code>toupper/tolower</code>.  This is absolutely guaranteed to work --
 | |
|       but <emphasis>only</emphasis> if the string contains <emphasis>only</emphasis> characters
 | |
|       from the basic source character set, and there are <emphasis>only</emphasis>
 | |
|       96 of those.  Which means that not even all English text can be
 | |
|       represented (certain British spellings, proper names, and so forth).
 | |
|       So, if all your input forevermore consists of only those 96
 | |
|       characters (hahahahahaha), then you're done.
 | |
|    </para>
 | |
|    <para><emphasis>Note</emphasis> that the
 | |
|       <code>ToUpper</code> and <code>ToLower</code> function objects
 | |
|       are needed because <code>toupper</code> and <code>tolower</code>
 | |
|       are overloaded names (declared in <code><cctype></code> and
 | |
|       <code><locale></code>) so the template-arguments for
 | |
|       <code>transform<></code> cannot be deduced, as explained in
 | |
|       <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://gcc.gnu.org/ml/libstdc++/2002-11/msg00180.html">this
 | |
|       message</link>.
 | |
|       <!-- section 14.8.2.4 clause 16 in ISO 14882:1998  -->
 | |
|       At minimum, you can write short wrappers like
 | |
|    </para>
 | |
|    <programlisting>
 | |
|    char toLower (char c)
 | |
|    {
 | |
|       return std::tolower(c);
 | |
|    } </programlisting>
 | |
|    <para>(Thanks to James Kanze for assistance and suggestions on all of this.)
 | |
|    </para>
 | |
|    <para>Another common operation is trimming off excess whitespace.  Much
 | |
|       like transformations, this task is trivial with the use of string's
 | |
|       <code>find</code> family.  These examples are broken into multiple
 | |
|       statements for readability:
 | |
|    </para>
 | |
|    <programlisting>
 | |
|    std::string  str (" \t blah blah blah    \n ");
 | |
| 
 | |
|    // trim leading whitespace
 | |
|    string::size_type  notwhite = str.find_first_not_of(" \t\n");
 | |
|    str.erase(0,notwhite);
 | |
| 
 | |
|    // trim trailing whitespace
 | |
|    notwhite = str.find_last_not_of(" \t\n");
 | |
|    str.erase(notwhite+1); </programlisting>
 | |
|    <para>Obviously, the calls to <code>find</code> could be inserted directly
 | |
|       into the calls to <code>erase</code>, in case your compiler does not
 | |
|       optimize named temporaries out of existence.
 | |
|    </para>
 | |
| 
 | |
|   </section>
 | |
|   <section xml:id="strings.string.case" xreflabel="Case Sensitivity"><info><title>Case Sensitivity</title></info>
 | |
|     
 | |
|     <para>
 | |
|     </para>
 | |
| 
 | |
|    <para>The well-known-and-if-it-isn't-well-known-it-ought-to-be
 | |
|       <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.gotw.ca/gotw/">Guru of the Week</link>
 | |
|       discussions held on Usenet covered this topic in January of 1998.
 | |
|       Briefly, the challenge was, <quote>write a 'ci_string' class which
 | |
|       is identical to the standard 'string' class, but is
 | |
|       case-insensitive in the same way as the (common but nonstandard)
 | |
|       C function stricmp()</quote>.
 | |
|    </para>
 | |
|    <programlisting>
 | |
|    ci_string s( "AbCdE" );
 | |
| 
 | |
|    // case insensitive
 | |
|    assert( s == "abcde" );
 | |
|    assert( s == "ABCDE" );
 | |
| 
 | |
|    // still case-preserving, of course
 | |
|    assert( strcmp( s.c_str(), "AbCdE" ) == 0 );
 | |
|    assert( strcmp( s.c_str(), "abcde" ) != 0 ); </programlisting>
 | |
| 
 | |
|    <para>The solution is surprisingly easy.  The original answer was
 | |
|    posted on Usenet, and a revised version appears in Herb Sutter's
 | |
|    book <emphasis>Exceptional C++</emphasis> and on his website as <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.gotw.ca/gotw/029.htm">GotW 29</link>.
 | |
|    </para>
 | |
|    <para>See?  Told you it was easy!</para>
 | |
|    <para>
 | |
|      <emphasis>Added June 2000:</emphasis> The May 2000 issue of C++
 | |
|      Report contains a fascinating <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://lafstern.org/matt/col2_new.pdf"> article</link> by
 | |
|      Matt Austern (yes, <emphasis>the</emphasis> Matt Austern) on why
 | |
|      case-insensitive comparisons are not as easy as they seem, and
 | |
|      why creating a class is the <emphasis>wrong</emphasis> way to go
 | |
|      about it in production code.  (The GotW answer mentions one of
 | |
|      the principle difficulties; his article mentions more.)
 | |
|    </para>
 | |
|    <para>Basically, this is "easy" only if you ignore some things,
 | |
|       things which may be too important to your program to ignore.  (I chose
 | |
|       to ignore them when originally writing this entry, and am surprised
 | |
|       that nobody ever called me on it...)  The GotW question and answer
 | |
|       remain useful instructional tools, however.
 | |
|    </para>
 | |
|    <para><emphasis>Added September 2000:</emphasis>  James Kanze provided a link to a
 | |
|       <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.unicode.org/reports/tr21/tr21-5.html">Unicode
 | |
|       Technical Report discussing case handling</link>, which provides some
 | |
|       very good information.
 | |
|    </para>
 | |
| 
 | |
|   </section>
 | |
|   <section xml:id="strings.string.character_types" xreflabel="Arbitrary Characters"><info><title>Arbitrary Character Types</title></info>
 | |
|     
 | |
|     <para>
 | |
|     </para>
 | |
| 
 | |
|    <para>The <code>std::basic_string</code> is tantalizingly general, in that
 | |
|       it is parameterized on the type of the characters which it holds.
 | |
|       In theory, you could whip up a Unicode character class and instantiate
 | |
|       <code>std::basic_string<my_unicode_char></code>, or assuming
 | |
|       that integers are wider than characters on your platform, maybe just
 | |
|       declare variables of type <code>std::basic_string<int></code>.
 | |
|    </para>
 | |
|    <para>That's the theory.  Remember however that basic_string has additional
 | |
|       type parameters, which take default arguments based on the character
 | |
|       type (called <code>CharT</code> here):
 | |
|    </para>
 | |
|    <programlisting>
 | |
|       template <typename CharT,
 | |
| 		typename Traits = char_traits<CharT>,
 | |
| 		typename Alloc = allocator<CharT> >
 | |
|       class basic_string { .... };</programlisting>
 | |
|    <para>Now, <code>allocator<CharT></code> will probably Do The Right
 | |
|       Thing by default, unless you need to implement your own allocator
 | |
|       for your characters.
 | |
|    </para>
 | |
|    <para>But <code>char_traits</code> takes more work.  The char_traits
 | |
|       template is <emphasis>declared</emphasis> but not <emphasis>defined</emphasis>.
 | |
|       That means there is only
 | |
|    </para>
 | |
|    <programlisting>
 | |
|       template <typename CharT>
 | |
| 	struct char_traits
 | |
| 	{
 | |
| 	    static void foo (type1 x, type2 y);
 | |
| 	    ...
 | |
| 	};</programlisting>
 | |
|    <para>and functions such as char_traits<CharT>::foo() are not
 | |
|       actually defined anywhere for the general case.  The C++ standard
 | |
|       permits this, because writing such a definition to fit all possible
 | |
|       CharT's cannot be done.
 | |
|    </para>
 | |
|    <para>The C++ standard also requires that char_traits be specialized for
 | |
|       instantiations of <code>char</code> and <code>wchar_t</code>, and it
 | |
|       is these template specializations that permit entities like
 | |
|       <code>basic_string<char,char_traits<char>></code> to work.
 | |
|    </para>
 | |
|    <para>If you want to use character types other than char and wchar_t,
 | |
|       such as <code>unsigned char</code> and <code>int</code>, you will
 | |
|       need suitable specializations for them.  For a time, in earlier
 | |
|       versions of GCC, there was a mostly-correct implementation that
 | |
|       let programmers be lazy but it broke under many situations, so it
 | |
|       was removed.  GCC 3.4 introduced a new implementation that mostly
 | |
|       works and can be specialized even for <code>int</code> and other
 | |
|       built-in types.
 | |
|    </para>
 | |
|    <para>If you want to use your own special character class, then you have
 | |
|       <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://gcc.gnu.org/ml/libstdc++/2002-08/msg00163.html">a lot
 | |
|       of work to do</link>, especially if you with to use i18n features
 | |
|       (facets require traits information but don't have a traits argument).
 | |
|    </para>
 | |
|    <para>Another example of how to specialize char_traits was given <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://gcc.gnu.org/ml/libstdc++/2002-08/msg00260.html">on the
 | |
|       mailing list</link> and at a later date was put into the file <code>
 | |
|       include/ext/pod_char_traits.h</code>.  We agree
 | |
|       that the way it's used with basic_string (scroll down to main())
 | |
|       doesn't look nice, but that's because <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://gcc.gnu.org/ml/libstdc++/2002-08/msg00236.html">the
 | |
|       nice-looking first attempt</link> turned out to <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://gcc.gnu.org/ml/libstdc++/2002-08/msg00242.html">not
 | |
|       be conforming C++</link>, due to the rule that CharT must be a POD.
 | |
|       (See how tricky this is?)
 | |
|    </para>
 | |
| 
 | |
|   </section>
 | |
| 
 | |
|   <section xml:id="strings.string.token" xreflabel="Tokenizing"><info><title>Tokenizing</title></info>
 | |
|     
 | |
|     <para>
 | |
|     </para>
 | |
|    <para>The Standard C (and C++) function <code>strtok()</code> leaves a lot to
 | |
|       be desired in terms of user-friendliness.  It's unintuitive, it
 | |
|       destroys the character string on which it operates, and it requires
 | |
|       you to handle all the memory problems.  But it does let the client
 | |
|       code decide what to use to break the string into pieces; it allows
 | |
|       you to choose the "whitespace," so to speak.
 | |
|    </para>
 | |
|    <para>A C++ implementation lets us keep the good things and fix those
 | |
|       annoyances.  The implementation here is more intuitive (you only
 | |
|       call it once, not in a loop with varying argument), it does not
 | |
|       affect the original string at all, and all the memory allocation
 | |
|       is handled for you.
 | |
|    </para>
 | |
|    <para>It's called stringtok, and it's a template function. Sources are
 | |
|    as below, in a less-portable form than it could be, to keep this
 | |
|    example simple (for example, see the comments on what kind of
 | |
|    string it will accept).
 | |
|    </para>
 | |
| 
 | |
| <programlisting>
 | |
| #include <string>
 | |
| template <typename Container>
 | |
| void
 | |
| stringtok(Container &container, string const &in,
 | |
| 	  const char * const delimiters = " \t\n")
 | |
| {
 | |
|     const string::size_type len = in.length();
 | |
| 	  string::size_type i = 0;
 | |
| 
 | |
|     while (i < len)
 | |
|     {
 | |
| 	// Eat leading whitespace
 | |
| 	i = in.find_first_not_of(delimiters, i);
 | |
| 	if (i == string::npos)
 | |
| 	  return;   // Nothing left but white space
 | |
| 
 | |
| 	// Find the end of the token
 | |
| 	string::size_type j = in.find_first_of(delimiters, i);
 | |
| 
 | |
| 	// Push token
 | |
| 	if (j == string::npos)
 | |
| 	{
 | |
| 	  container.push_back(in.substr(i));
 | |
| 	  return;
 | |
| 	}
 | |
| 	else
 | |
| 	  container.push_back(in.substr(i, j-i));
 | |
| 
 | |
| 	// Set up for next loop
 | |
| 	i = j + 1;
 | |
|     }
 | |
| }
 | |
| </programlisting>
 | |
| 
 | |
| 
 | |
|    <para>
 | |
|      The author uses a more general (but less readable) form of it for
 | |
|      parsing command strings and the like.  If you compiled and ran this
 | |
|      code using it:
 | |
|    </para>
 | |
| 
 | |
| 
 | |
|    <programlisting>
 | |
|    std::list<string>  ls;
 | |
|    stringtok (ls, " this  \t is\t\n  a test  ");
 | |
|    for (std::list<string>const_iterator i = ls.begin();
 | |
| 	i != ls.end(); ++i)
 | |
|    {
 | |
|        std::cerr << ':' << (*i) << ":\n";
 | |
|    } </programlisting>
 | |
|    <para>You would see this as output:
 | |
|    </para>
 | |
|    <programlisting>
 | |
|    :this:
 | |
|    :is:
 | |
|    :a:
 | |
|    :test: </programlisting>
 | |
|    <para>with all the whitespace removed.  The original <code>s</code> is still
 | |
|       available for use, <code>ls</code> will clean up after itself, and
 | |
|       <code>ls.size()</code> will return how many tokens there were.
 | |
|    </para>
 | |
|    <para>As always, there is a price paid here, in that stringtok is not
 | |
|       as fast as strtok.  The other benefits usually outweigh that, however.
 | |
|    </para>
 | |
| 
 | |
|    <para><emphasis>Added February 2001:</emphasis>  Mark Wilden pointed out that the
 | |
|       standard <code>std::getline()</code> function can be used with standard
 | |
|       <code>istringstreams</code> to perform
 | |
|       tokenizing as well.  Build an istringstream from the input text,
 | |
|       and then use std::getline with varying delimiters (the three-argument
 | |
|       signature) to extract tokens into a string.
 | |
|    </para>
 | |
| 
 | |
| 
 | |
|   </section>
 | |
|   <section xml:id="strings.string.shrink" xreflabel="Shrink to Fit"><info><title>Shrink to Fit</title></info>
 | |
|     
 | |
|     <para>
 | |
|     </para>
 | |
|    <para>From GCC 3.4 calling <code>s.reserve(res)</code> on a
 | |
|       <code>string s</code> with <code>res < s.capacity()</code> will
 | |
|       reduce the string's capacity to <code>std::max(s.size(), res)</code>.
 | |
|    </para>
 | |
|    <para>This behaviour is suggested, but not required by the standard. Prior
 | |
|       to GCC 3.4 the following alternative can be used instead
 | |
|    </para>
 | |
|    <programlisting>
 | |
|       std::string(str.data(), str.size()).swap(str);
 | |
|    </programlisting>
 | |
|    <para>This is similar to the idiom for reducing
 | |
|       a <code>vector</code>'s memory usage
 | |
|       (see <link linkend="faq.size_equals_capacity">this FAQ
 | |
|       entry</link>) but the regular copy constructor cannot be used
 | |
|       because libstdc++'s <code>string</code> is Copy-On-Write.
 | |
|    </para>
 | |
|    <para>In <link linkend="status.iso.2011">C++11</link> mode you can call
 | |
|       <code>s.shrink_to_fit()</code> to achieve the same effect as
 | |
|       <code>s.reserve(s.size())</code>.
 | |
|    </para>
 | |
| 
 | |
| 
 | |
|   </section>
 | |
| 
 | |
|   <section xml:id="strings.string.Cstring" xreflabel="CString (MFC)"><info><title>CString (MFC)</title></info>
 | |
|     
 | |
|     <para>
 | |
|     </para>
 | |
| 
 | |
|    <para>A common lament seen in various newsgroups deals with the Standard
 | |
|       string class as opposed to the Microsoft Foundation Class called
 | |
|       CString.  Often programmers realize that a standard portable
 | |
|       answer is better than a proprietary nonportable one, but in porting
 | |
|       their application from a Win32 platform, they discover that they
 | |
|       are relying on special functions offered by the CString class.
 | |
|    </para>
 | |
|    <para>Things are not as bad as they seem.  In
 | |
|       <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://gcc.gnu.org/ml/gcc/1999-04n/msg00236.html">this
 | |
|       message</link>, Joe Buck points out a few very important things:
 | |
|    </para>
 | |
|       <itemizedlist>
 | |
| 	 <listitem><para>The Standard <code>string</code> supports all the operations
 | |
| 	     that CString does, with three exceptions.
 | |
| 	 </para></listitem>
 | |
| 	 <listitem><para>Two of those exceptions (whitespace trimming and case
 | |
| 	     conversion) are trivial to implement.  In fact, we do so
 | |
| 	     on this page.
 | |
| 	 </para></listitem>
 | |
| 	 <listitem><para>The third is <code>CString::Format</code>, which allows formatting
 | |
| 	     in the style of <code>sprintf</code>.  This deserves some mention:
 | |
| 	 </para></listitem>
 | |
|       </itemizedlist>
 | |
|    <para>
 | |
|       The old libg++ library had a function called form(), which did much
 | |
|       the same thing.  But for a Standard solution, you should use the
 | |
|       stringstream classes.  These are the bridge between the iostream
 | |
|       hierarchy and the string class, and they operate with regular
 | |
|       streams seamlessly because they inherit from the iostream
 | |
|       hierarchy.  An quick example:
 | |
|    </para>
 | |
|    <programlisting>
 | |
|    #include <iostream>
 | |
|    #include <string>
 | |
|    #include <sstream>
 | |
| 
 | |
|    string f (string& incoming)     // incoming is "foo  N"
 | |
|    {
 | |
|        istringstream   incoming_stream(incoming);
 | |
|        string          the_word;
 | |
|        int             the_number;
 | |
| 
 | |
|        incoming_stream >> the_word        // extract "foo"
 | |
| 		       >> the_number;     // extract N
 | |
| 
 | |
|        ostringstream   output_stream;
 | |
|        output_stream << "The word was " << the_word
 | |
| 		     << " and 3*N was " << (3*the_number);
 | |
| 
 | |
|        return output_stream.str();
 | |
|    } </programlisting>
 | |
|    <para>A serious problem with CString is a design bug in its memory
 | |
|       allocation.  Specifically, quoting from that same message:
 | |
|    </para>
 | |
|    <programlisting>
 | |
|    CString suffers from a common programming error that results in
 | |
|    poor performance.  Consider the following code:
 | |
| 
 | |
|    CString n_copies_of (const CString& foo, unsigned n)
 | |
|    {
 | |
| 	   CString tmp;
 | |
| 	   for (unsigned i = 0; i < n; i++)
 | |
| 		   tmp += foo;
 | |
| 	   return tmp;
 | |
|    }
 | |
| 
 | |
|    This function is O(n^2), not O(n).  The reason is that each +=
 | |
|    causes a reallocation and copy of the existing string.  Microsoft
 | |
|    applications are full of this kind of thing (quadratic performance
 | |
|    on tasks that can be done in linear time) -- on the other hand,
 | |
|    we should be thankful, as it's created such a big market for high-end
 | |
|    ix86 hardware. :-)
 | |
| 
 | |
|    If you replace CString with string in the above function, the
 | |
|    performance is O(n).
 | |
|    </programlisting>
 | |
|    <para>Joe Buck also pointed out some other things to keep in mind when
 | |
|       comparing CString and the Standard string class:
 | |
|    </para>
 | |
|       <itemizedlist>
 | |
| 	 <listitem><para>CString permits access to its internal representation; coders
 | |
| 	     who exploited that may have problems moving to <code>string</code>.
 | |
| 	 </para></listitem>
 | |
| 	 <listitem><para>Microsoft ships the source to CString (in the files
 | |
| 	     MFC\SRC\Str{core,ex}.cpp), so you could fix the allocation
 | |
| 	     bug and rebuild your MFC libraries.
 | |
| 	     <emphasis><emphasis>Note:</emphasis> It looks like the CString shipped
 | |
| 	     with VC++6.0 has fixed this, although it may in fact have been
 | |
| 	     one of the VC++ SPs that did it.</emphasis>
 | |
| 	 </para></listitem>
 | |
| 	 <listitem><para><code>string</code> operations like this have O(n) complexity
 | |
| 	     <emphasis>if the implementors do it correctly</emphasis>.  The libstdc++
 | |
| 	     implementors did it correctly.  Other vendors might not.
 | |
| 	 </para></listitem>
 | |
| 	 <listitem><para>While chapters of the SGI STL are used in libstdc++, their
 | |
| 	     string class is not.  The SGI <code>string</code> is essentially
 | |
| 	     <code>vector<char></code> and does not do any reference
 | |
| 	     counting like libstdc++'s does.  (It is O(n), though.)
 | |
| 	     So if you're thinking about SGI's string or rope classes,
 | |
| 	     you're now looking at four possibilities:  CString, the
 | |
| 	     libstdc++ string, the SGI string, and the SGI rope, and this
 | |
| 	     is all before any allocator or traits customizations!  (More
 | |
| 	     choices than you can shake a stick at -- want fries with that?)
 | |
| 	 </para></listitem>
 | |
|       </itemizedlist>
 | |
| 
 | |
|   </section>
 | |
| </section>
 | |
| 
 | |
| <!-- Sect1 03 : Interacting with C -->
 | |
| 
 | |
| </chapter>
 |