Strings are the data type that enable you to work with textual data.
So far, you have learned how to:
"hello, world"
str(110)
evaluates to "110"
"COMP" + str(110)
evaluates to "COMP110"
But what, really, are strings?
In this lesson, we will deepen on our understanding of strings. We will zoom in on strings and look at them up close narrowing in on individual characters and ask what really is a character?
str
Textual data is made up of a sequence of characters. For example, the str
"hello"
comprises the characters 'h'
, 'e'
, 'l'
, 'l'
, 'o'
. For the computer to represent and be work with string, it must certainly also be able to work with characters.
It is worth noticing the distinction between characters and letters. Textual data must comprise of more than just letters! There are numbers, spaces, tabs, new lines, many alphabets 爱, and even emoji 🤠 that are also characters in text. Strings are a sequence of characters.
The Python programming language, along with many other languages, have a special notation when accessing an individual item in a sequence called subscript notation. This notation gives you the ability to ask a sequence for its “item at position int index”. The int index begins counting from 0
, such that the item at index 0
is the first item, the item at index 1
is the second item, and so on.
Subscript notation uses square brackets [
]
to surround an int
index expression. This notation follows a str
expression. For example, "abcd"[0]
evaluates to the character 'a'
and "abcd"[1]
evaluates to the character 'b'
. Typically, you will subscript off of a variable storing a str
and not a string literal. As an example, try the following in a Python REPL:
>>> a_str = "abcd"
>>> print(a_str[0])
a
>>> print(a_str[3])
d
>>> print(a_str[4])
IndexError: string index out of range
Notice in the example above, if you ask for an index beyond its maximum index it will lead to an IndexError
.
len
to find a str
’s lengthIt’s helpful to know how many characters are in a str
, to avoid writing code that leads to an IndexError
. Python has a built-in len
function that will tell you the “length” of any str
.
>>> print(len("abcd"))
4
>>> message = "a b c"
>>> print(len(message))
5
>>> print(message[len(message) - 1])
'c'
Notice in the second example of len
above the length of str
"a b c"
is 5
. Remember, our emphasis is on characters and spaces are characters.
Also notice, since sequences begin indexing at 0
, the last character’s index is always one less than the length of the sequence.
As a bit of foreshadowing, both the subscript notation and len
will be used to work with many other sequences and collections of data beyond just strings.
int
Expressions in Subscript NotationIt was mentioned in passing above, but must be addressed with emphasis now, the index number is an int
expression. Beyond simple int
literal values, you can form any expression you’d like as long as it evaluates to an int
. This is a SUPER POWER!
This means you can subscript with int
variables, which is very common, but also int
arithmetic and function calls to functions that return an int
. Let’s focus on the former with an example:
>>> s = "asdf"
>>> i = 0
>>> print(s[i])
a
>>> i += 1
>>> print(s[i])
s
>>> i += 1
>>> print(s[i])
d
>>> i += 1
>>> print(s[i])
f
>>> i += 1
>>> print(s[i])
IndexError: string index out of range
Notice, since i
is an int
variable we can access it from within the subscript notation of s[i]
. Since we can reassign new values to i
while the program is running, that means we can control which index the expression s[i]
refers to at runtime.
Also notice the same two statements are repeated and the effect was to move index-by-index through the characters of the string. When you find yourself writing repeating statements it is a strong indicator you can use a loop to achieve your goal. As you know, though, loops have a test condition that ends the looping behavior. To loop over all items in a str, you can use its len
as part of the loop condition.
>>> s = "hiya!"
>>> i = 0
>>> while i < len(s):
... print(s[i])
... i += 1
...
h
i
y
a
!
With this capability, you can write algorithms that are guided by the individual characters it is made of. For example, when a website checks your password to be sure it meets some criteria such as a minimum length and contains certain special characters, numbers, and letters… you could now write that function!
These same concepts of working with items in a sequence using subscript notation are applicable far beyond just string data. In the near future we will learn about lists which give you the ability to form sequences of items of any type not just characters. The same syntax, concepts, and general patterns for thinking apply there, too.
If a str
is a sequence of characters, what are characters?
Like an onion, every abstraction has many layers to it. If you cut too many layers into the onion you leave the realm of computer science and enter the fields of physics and philosophy. You might also cry involuntarily. So let’s peel back only one more layer.
What a character is depends on how it is represented. On your screen, most characters present visually as a shape or space (but many do not!). A single character of data can be represented in many different visual forms by changing fonts and you can accept that it’s still the same, single character of data just viewed in a different way. Icon fonts such as wingdings intentionally present characters, even letters, as icons very different from their traditional form. Changing the font back and forth between an icon font and a traditional one doesn’t change the underlying character data, only how we see it and interpret it.
Inside the computer, characters are represented as coded patterns of 0s and 1s. There is a generally agreed upon “encoding” of character data in computing systems called the American Standard Code for Information Interchange abbreviated to ASCII that was designed and decided upon in the 60s. For example, the character A
has the binary code 01000001
. Numerical data, such as int
values, can also be represented in binary system. It so happens that the same binary pattern 01000001
can be interpretted as the int
value 65
. Since binary is outside our concern, the main takeaway here is every character has a corresponding int
value.
ord
function to find a character’s int
codePython’s built-in function ord
, short for “ordinal” which is a bit of a historical homage, which takes a single-character string as an input parameter and returns the int
representation of the character’s binary code.
>>> ord("A")
65
>>> ord("B")
66
>>> ord("Z")
90
>>> ord("a")
97
>>> ord("b")
98
>>> ord("z")
122
There are a few important observations to make in the example above.
First, notice that the codes for letters relative to one another is logical. In the English alphabet, A is followed directly by B, just as in integers 1 is followed by 2. Similarly, A’s integer ASCII code is 65 and B’s is 66. The specific numbers of either are not important, but their relationship to one another is.
>>> "A" < "B"
True
>>> "z" > "a"
True
>>> "a" < "Z"
False
The first three examples are reasonable, but the last might come as a surprise! How is “a” less than “Z”? Well, comparing their ord
representations and you can learn why. For case-insensitive comparisons between any two characters, or strings, some additional work is needed.
chr
function to convert an int
to a characterSince characters and their integer codes are two sides of the same coin, you can freely go back and forth:
>>> chr(65)
'A'
>>> chr(122)
'Z'
>>> chr(ord('A'))
'A'
>>> ord(chr(65))
65
The chr
function is built-in to Python, takes an int
parameter, and returns the single character representation as a string.
When ASCII was decided in the 60s, it was a big achievement to include both lower and uppercase letters in the standard. Emoji, and much more importantly large alphabet languages such as Chinese, were not possible until later. As additional characters were added to international standards, the set of total characters possible expanded well beyond ASCII’s initial 127 character specification.
For example, try the following in the REPL:
>>> chr(129312)
Hold on to your saddles, because we’re about to go on a little adventure. This is a bit outside of the scope of your concerns right now, but in order to make use of Emoji in our programs (which is of utmost importance) there’s just a little more to the story to reveal.
hex
on large integersJust as the integer 90
can be interpretted in a binary system with 01011010
, it can also be represented in a hexademical system with 5A
. It is beyond your concern to navigate numbering systems, but the one you know and love is base 10 meaning we have 10 digits ranging from 0-9. (Notice the 0 indexing you grew comfortable with in elementary school!) Binary is base 2 and has only 2 digits: 0 and 1. Hexadecimal is base 16 and has 16 digits, 0-9 followed by A-F which correspond to the decimal values of 10-15. Computer scientists love hexadecimal because each digit corresponds to four binary digits. Notice that in the example: 01011010
, which is 8 binary digits, is equivalent to 5A
.
Python has a built-in hex
function for converting to its representation. The 0x
in front of the hexadecimal notation can be ignored and hex is case insensitive.
>>> hex(90)
0x5a
When looking up the codes for emoji or characters in other languages, they will tend to be presented to you in a hex format, such as on this site. You will notice in the code column, there is a format of U+1F920
. The U
tells you this is Unicode, an international coding standard, and the 1F920
is a hexadecimal representation of the code for the cowboy emoji. In Python, you can use such a Unicode character in your strings as follows:
>>> print("The \U0001F920 rides a \U0001F40E!")
The 🤠 rides a 🐎!
The leading backslash begins an escape sequence, which will be discussed in depth shortly. The U
is an indication that what will follow is an 8-digit hex representation of a unicode character. Then, to encode 1F920
, we must add three leading 0
s for padding because 8 digits are expected.
It is worth taking a moment to appreciate that Python is doing a proper job of treating those emoji each as an individual item in our sequence of characters.
>>> emoji: str = "\U0001F920\U0001F40E"
>>> print(emoji)
🤠🐎
>>> len(emoji)
2
>>> emoji[0]
🤠
Notice in the previous example the backslashes in the string "\U0001F920\U0001F40E"
are signalling something special is about to occur. In this case, what follows is a U
which indicates “8 hexidecimal digits encoding a single unicode character” follow.
There are other escape sequences, as well. Here are some common ones:
Escape Sequence | Meaning |
---|---|
\" |
Double quote (" ) |
\' |
Double quote (' ) |
\t |
Tab |
\n |
New Line |
\Uxxxxxxxx |
32-bit unicode character |
\\ |
Baskslash (\ ) |
Notice these escape sequences all begin with a backslash. The most interesting one is the first. If you write string surrounded in double quote characters "
, how can you use a double quote character in a string? Escaping to the rescue! That sequence, when evaluated, gets interpretted as a single backslash character in the resulting string value.
The last entry in the table is also interesting. If backslashes are how we begin writing an escape sequence in a string literal in Python… how do we write a single backslash? Well, by writing two backslashes back-to-back, of course. The first backslash begins an escape sequence, the second backslash causes the sequence to evaluate to a single backslash.
Zooming back out to thinking of strings at a high-level, by now you have used concatenation enough to recognize that concatenating strings can be a lot of work! Especially if you are concatenating non-string values in the middle of a larger string. Modern Python has a special kind of string literal called a format string or f-string for short, that makes this much easier. Consider the following examples:
>>> course: int = 110
>>> print("I am in COMP" + str(course) + " right now!")
I am in COMP110 right now!
>>> print(f"I am in COMP{ course } right now!")
I am in COMP110 right now!
A key distinction between a regular string and an f-string is that it begins with the letter f
preceeding its quotes. Notice the difference of f"Hi"
and "Hi"
, where the former is an f
string.
Inside of an f-string you can write an expression inside of curly braces and it will get substituted with the expression’s value when the string literal is evaluated. Spaces inside of the curly braces are ignored. This is especially handy if you are building a string with multiple variables being concatenated together. Consider the difference of:
>>> name: str = "Lauren"
>>> age_turning: int = 21
>>> print("Hello " + name + ", you're almost " + str(age_turning) + "!")
Hello Lauren, you're almost 21!
>>> print(f"Hello {name}, you're almost {age_turning}!")
Hello Lauren, you're almost 21!
There are other powerful things format strings can do, too, but they are outside the scope of this course. If you’d like to learn more this guide covers many useful cases. Other modern programming languages are adopting variations of this same concept which you may also hear referred to as string interpolation.