Strings Tutorial


Strings are the data type that enable you to work with textual data.

So far, you have learned how to:

  1. Write literals such as "hello, world"
  2. Convert other types into strings, str(110) evaluates to "110"
  3. Build strings through concatenation "COMP" + str(110) evaluates to "COMP110"

But what, really, are strings?

In this lesson, we will deepen on our understanding of strings. We will zoom in on strings and look at them up close narrowing in on individual characters and ask what really is a character?

Accessing characters of a str

Textual data is made up of a sequence of characters. For example, the str "hello" comprises the characters 'h', 'e', 'l', 'l', 'o'. For the computer to represent and be work with string, it must certainly also be able to work with characters.

It is worth noticing the distinction between characters and letters. Textual data must comprise of more than just letters! There are numbers, spaces, tabs, new lines, many alphabets 爱, and even emoji 🤠 that are also characters in text. Strings are a sequence of characters.

The Python programming language, along with many other languages, have a special notation when accessing an individual item in a sequence called subscript notation. This notation gives you the ability to ask a sequence for its “item at position int index”. The int index begins counting from 0, such that the item at index 0 is the first item, the item at index 1 is the second item, and so on.

Subscript Notation

Subscript notation uses square brackets [ ] to surround an int index expression. This notation follows a str expression. For example, "abcd"[0] evaluates to the character 'a' and "abcd"[1] evaluates to the character 'b'. Typically, you will subscript off of a variable storing a str and not a string literal. As an example, try the following in a Python REPL:

>>> a_str = "abcd"
>>> print(a_str[0])
a
>>> print(a_str[3])
d
>>> print(a_str[4])
IndexError: string index out of range

Notice in the example above, if you ask for an index beyond its maximum index it will lead to an IndexError.

Use len to find a str’s length

It’s helpful to know how many characters are in a str, to avoid writing code that leads to an IndexError. Python has a built-in len function that will tell you the “length” of any str.

>>> print(len("abcd"))
4
>>> message = "a b c"
>>> print(len(message))
5
>>> print(message[len(message) - 1])
'c'

Notice in the second example of len above the length of str "a b c" is 5. Remember, our emphasis is on characters and spaces are characters.

Also notice, since sequences begin indexing at 0, the last character’s index is always one less than the length of the sequence.

As a bit of foreshadowing, both the subscript notation and len will be used to work with many other sequences and collections of data beyond just strings.

Using int Expressions in Subscript Notation

It was mentioned in passing above, but must be addressed with emphasis now, the index number is an int expression. Beyond simple int literal values, you can form any expression you’d like as long as it evaluates to an int. This is a SUPER POWER!

This means you can subscript with int variables, which is very common, but also int arithmetic and function calls to functions that return an int. Let’s focus on the former with an example:

>>> s = "asdf"
>>> i = 0
>>> print(s[i])
a
>>> i += 1
>>> print(s[i])
s
>>> i += 1
>>> print(s[i])
d
>>> i += 1
>>> print(s[i])
f
>>> i += 1
>>> print(s[i])
IndexError: string index out of range

Notice, since i is an int variable we can access it from within the subscript notation of s[i]. Since we can reassign new values to i while the program is running, that means we can control which index the expression s[i] refers to at runtime.

Also notice the same two statements are repeated and the effect was to move index-by-index through the characters of the string. When you find yourself writing repeating statements it is a strong indicator you can use a loop to achieve your goal. As you know, though, loops have a test condition that ends the looping behavior. To loop over all items in a str, you can use its len as part of the loop condition.

>>> s = "hiya!"
>>> i = 0
>>> while i < len(s):
...     print(s[i])
...     i += 1
...
h
i
y
a
!

With this capability, you can write algorithms that are guided by the individual characters it is made of. For example, when a website checks your password to be sure it meets some criteria such as a minimum length and contains certain special characters, numbers, and letters… you could now write that function!

These same concepts of working with items in a sequence using subscript notation are applicable far beyond just string data. In the near future we will learn about lists which give you the ability to form sequences of items of any type not just characters. The same syntax, concepts, and general patterns for thinking apply there, too.

Characters

If a str is a sequence of characters, what are characters?

Like an onion, every abstraction has many layers to it. If you cut too many layers into the onion you leave the realm of computer science and enter the fields of physics and philosophy. You might also cry involuntarily. So let’s peel back only one more layer.

What a character is depends on how it is represented. On your screen, most characters present visually as a shape or space (but many do not!). A single character of data can be represented in many different visual forms by changing fonts and you can accept that it’s still the same, single character of data just viewed in a different way. Icon fonts such as wingdings intentionally present characters, even letters, as icons very different from their traditional form. Changing the font back and forth between an icon font and a traditional one doesn’t change the underlying character data, only how we see it and interpret it.

Inside the computer, characters are represented as coded patterns of 0s and 1s. There is a generally agreed upon “encoding” of character data in computing systems called the American Standard Code for Information Interchange abbreviated to ASCII that was designed and decided upon in the 60s. For example, the character A has the binary code 01000001. Numerical data, such as int values, can also be represented in binary system. It so happens that the same binary pattern 01000001 can be interpretted as the int value 65. Since binary is outside our concern, the main takeaway here is every character has a corresponding int value.

Use the ord function to find a character’s int code

Python’s built-in function ord, short for “ordinal” which is a bit of a historical homage, which takes a single-character string as an input parameter and returns the int representation of the character’s binary code.

>>> ord("A")
65
>>> ord("B")
66
>>> ord("Z")
90
>>> ord("a")
97
>>> ord("b")
98
>>> ord("z")
122

There are a few important observations to make in the example above.

First, notice that the codes for letters relative to one another is logical. In the English alphabet, A is followed directly by B, just as in integers 1 is followed by 2. Similarly, A’s integer ASCII code is 65 and B’s is 66. The specific numbers of either are not important, but their relationship to one another is.

>>> "A" < "B"
True
>>> "z" > "a"
True
>>> "a" < "Z"
False

The first three examples are reasonable, but the last might come as a surprise! How is “a” less than “Z”? Well, comparing their ord representations and you can learn why. For case-insensitive comparisons between any two characters, or strings, some additional work is needed.

Use the chr function to convert an int to a character

Since characters and their integer codes are two sides of the same coin, you can freely go back and forth:

>>> chr(65)
'A'
>>> chr(122)
'Z'
>>> chr(ord('A'))
'A'
>>> ord(chr(65))
65

The chr function is built-in to Python, takes an int parameter, and returns the single character representation as a string.

What about foreign languages and emoji?

When ASCII was decided in the 60s, it was a big achievement to include both lower and uppercase letters in the standard. Emoji, and much more importantly large alphabet languages such as Chinese, were not possible until later. As additional characters were added to international standards, the set of total characters possible expanded well beyond ASCII’s initial 127 character specification.

For example, try the following in the REPL:

>>> chr(129312)

Hold on to your saddles, because we’re about to go on a little adventure. This is a bit outside of the scope of your concerns right now, but in order to make use of Emoji in our programs (which is of utmost importance) there’s just a little more to the story to reveal.

Putting a hex on large integers

Just as the integer 90 can be interpretted in a binary system with 01011010, it can also be represented in a hexademical system with 5A. It is beyond your concern to navigate numbering systems, but the one you know and love is base 10 meaning we have 10 digits ranging from 0-9. (Notice the 0 indexing you grew comfortable with in elementary school!) Binary is base 2 and has only 2 digits: 0 and 1. Hexadecimal is base 16 and has 16 digits, 0-9 followed by A-F which correspond to the decimal values of 10-15. Computer scientists love hexadecimal because each digit corresponds to four binary digits. Notice that in the example: 01011010, which is 8 binary digits, is equivalent to 5A.

Python has a built-in hex function for converting to its representation. The 0x in front of the hexadecimal notation can be ignored and hex is case insensitive.

>>> hex(90)
0x5a

When looking up the codes for emoji or characters in other languages, they will tend to be presented to you in a hex format, such as on this site. You will notice in the code column, there is a format of U+1F920. The U tells you this is Unicode, an international coding standard, and the 1F920 is a hexadecimal representation of the code for the cowboy emoji. In Python, you can use such a Unicode character in your strings as follows:

>>> print("The \U0001F920 rides a \U0001F40E!")
The 🤠 rides a 🐎!

The leading backslash begins an escape sequence, which will be discussed in depth shortly. The U is an indication that what will follow is an 8-digit hex representation of a unicode character. Then, to encode 1F920, we must add three leading 0s for padding because 8 digits are expected.

It is worth taking a moment to appreciate that Python is doing a proper job of treating those emoji each as an individual item in our sequence of characters.

>>> emoji: str = "\U0001F920\U0001F40E"
>>> print(emoji)
🤠🐎
>>> len(emoji)
2
>>> emoji[0]
🤠

String Escape Sequences

Notice in the previous example the backslashes in the string "\U0001F920\U0001F40E" are signalling something special is about to occur. In this case, what follows is a U which indicates “8 hexidecimal digits encoding a single unicode character” follow.

There are other escape sequences, as well. Here are some common ones:

Escape Sequence Meaning
\" Double quote (")
\' Double quote (')
\t Tab
\n New Line
\Uxxxxxxxx 32-bit unicode character
\\ Baskslash (\)

Notice these escape sequences all begin with a backslash. The most interesting one is the first. If you write string surrounded in double quote characters ", how can you use a double quote character in a string? Escaping to the rescue! That sequence, when evaluated, gets interpretted as a single backslash character in the resulting string value.

The last entry in the table is also interesting. If backslashes are how we begin writing an escape sequence in a string literal in Python… how do we write a single backslash? Well, by writing two backslashes back-to-back, of course. The first backslash begins an escape sequence, the second backslash causes the sequence to evaluate to a single backslash.

f-Strings “Format” Strings

Zooming back out to thinking of strings at a high-level, by now you have used concatenation enough to recognize that concatenating strings can be a lot of work! Especially if you are concatenating non-string values in the middle of a larger string. Modern Python has a special kind of string literal called a format string or f-string for short, that makes this much easier. Consider the following examples:

>>> course: int = 110
>>> print("I am in COMP" + str(course) + " right now!")
I am in COMP110 right now!
>>> print(f"I am in COMP{ course } right now!")
I am in COMP110 right now!

A key distinction between a regular string and an f-string is that it begins with the letter f preceeding its quotes. Notice the difference of f"Hi" and "Hi", where the former is an f string.

Inside of an f-string you can write an expression inside of curly braces and it will get substituted with the expression’s value when the string literal is evaluated. Spaces inside of the curly braces are ignored. This is especially handy if you are building a string with multiple variables being concatenated together. Consider the difference of:

>>> name: str = "Lauren"
>>> age_turning: int = 21
>>> print("Hello " + name + ", you're almost " + str(age_turning) + "!")
Hello Lauren, you're almost 21!
>>> print(f"Hello {name}, you're almost {age_turning}!")
Hello Lauren, you're almost 21!

There are other powerful things format strings can do, too, but they are outside the scope of this course. If you’d like to learn more this guide covers many useful cases. Other modern programming languages are adopting variations of this same concept which you may also hear referred to as string interpolation.