Text

Objectives

  • [ ] String indexing and immutability

  • [ ] The string module

  • [ ] Manipulating text strings

  • [ ] Formatting text strings

  • [ ] Text I/O

  • [ ] Working with Unicode

  • [ ] Text to arrays

Strings

  • Point index

  • Interval index

  • Negative index

  • Stride

  • Reversing a string

  • Strings are immutable

References

[1]:
s = "hello world"
[2]:
s[0], s[6]
[2]:
('h', 'w')
[3]:
s[0:6]
[3]:
'hello '
[4]:
s[-1], s[-3]
[4]:
('d', 'r')
[5]:
s[::2]
[5]:
'hlowrd'
[6]:
s[::-1]
[6]:
'dlrow olleh'
[7]:
try:
    s[0] = 'H'
except TypeError as e:
    print(e)
'str' object does not support item assignment

The string module

  • String constants

  • String capwords

[8]:
import string
[9]:
string.digits
[9]:
'0123456789'
[10]:
string.ascii_letters
[10]:
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
[11]:
string.punctuation
[11]:
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
[12]:
string.whitespace
[12]:
' \t\n\r\x0b\x0c'
[13]:
string.printable
[13]:
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

Orphan function in strings module

[14]:
string.capwords(s)
[14]:
'Hello World'

String methods

Methods to change case

[15]:
s.upper()
[15]:
'HELLO WORLD'
[16]:
s.lower()
[16]:
'hello world'
[17]:
'ß'.casefold()
[17]:
'ss'
[18]:
s.capitalize()
[18]:
'Hello world'
[19]:
s.title()
[19]:
'Hello World'

Difference between title method and capwords function

[20]:
'hello:world'.title()
[20]:
'Hello:World'
[21]:
string.capwords('hello:world')
[21]:
'Hello:world'
[22]:
string.capwords('hello:world', sep=':')
[22]:
'Hello:World'

String predicates

[23]:
s.isalnum()
[23]:
False
[24]:
s.isalpha()
[24]:
False
[25]:
s.isnumeric()
[25]:
False
[26]:
s.isidentifier()
[26]:
False
[27]:
s.isprintable()
[27]:
True
[28]:
s.startswith('hell')
[28]:
True
[29]:
s.endswith('ld')
[29]:
True

Searching and counting

[30]:
'llo' in s
[30]:
True
[31]:
'foo' in s
[31]:
False
[32]:
s.find('llo')
[32]:
2
[33]:
s.index('llo')
[33]:
2
[34]:
s.find('foo')
[34]:
-1
[35]:
try:
    s.index('foo')
except ValueError as e:
    print(e)
substring not found
[36]:
s.count('l')
[36]:
3
[37]:
s.count('ll')
[37]:
1

Stripping

[38]:
'   hello world   '.strip()
[38]:
'hello world'
[39]:
'   hello world   '.lstrip()
[39]:
'hello world   '
[40]:
'   hello world   '.rstrip()
[40]:
'   hello world'

Splitting and joining

[41]:
s.split()
[41]:
['hello', 'world']
[42]:
s.split('l')
[42]:
['he', '', 'o wor', 'd']
[43]:
'-'.join(s)
[43]:
'h-e-l-l-o- -w-o-r-l-d'
[44]:
'-'.join(s.split())
[44]:
'hello-world'
[45]:
'l'.join(s.split('l'))
[45]:
'hello world'

Translation

[46]:
'GATTACA'.translate(str.maketrans('ACTG', 'TAGC'))
[46]:
'CTGGTAT'
[47]:
'GATTACA'.translate(str.maketrans('', '', 'AC'))
[47]:
'GTT'
[48]:
'GATTACA'.translate(str.maketrans(string.ascii_uppercase, string.ascii_lowercase))
[48]:
'gattaca'

ord and chr

[49]:
ord('A'), ord('a')
[49]:
(65, 97)
[50]:
chr(65), chr(97)
[50]:
('A', 'a')
[51]:
chr(ord('B') + (ord('a') - ord('A')))
[51]:
'b'

Formatting strings

C sytle formatting

[52]:
pi = 3.141592653589793
r = 2
[53]:
'area = %f * %d^2' % (pi, r)
[53]:
'area = 3.141593 * 2^2'

Precision and padding

[54]:
'area = %8.2f * %03d^2' % (pi, r)
[54]:
'area =     3.14 * 002^2'

Right align string

[55]:
'%10s = %8.2f * %03d^2' % ('area', pi, r)
[55]:
'      area =     3.14 * 002^2'

Left align string

[56]:
'%-10s = %8.2f * %03d^2' % ('area', pi, r)
[56]:
'area       =     3.14 * 002^2'

Using the format method

"{" [field_name] ["!" conversion] [":" format_spec] "}"
[57]:
'{big:,}'.format(big=int(1e9))
[57]:
'1,000,000,000'
[58]:
'{pct:.1%}'.format(pct=0.5)
[58]:
'50.0%'
[59]:
'area = {} * {}^2'.format(pi, r)
[59]:
'area = 3.141592653589793 * 2^2'
[60]:
'area = {a} * {b}^2'.format(a=pi, b=r)
[60]:
'area = 3.141592653589793 * 2^2'
[61]:
'area = {pi:8,.4} * {r:06d}^2'.format(pi=pi, r=r)
[61]:
'area =    3.142 * 000002^2'
[62]:
'{:>10}'.format('area')
[62]:
'      area'
[63]:
'{:<10}'.format('area')
[63]:
'area      '
[64]:
'{:^10}'.format('area')
[64]:
'   area   '
[65]:
'{:=^10}'.format('area')
[65]:
'===area==='
[66]:
import datetime

now = datetime.datetime.now()
'{:%a, %d %b %Y: %H:%M %p}'.format(now)
[66]:
'Mon, 13 Jan 2020: 10:41 AM'

Using f strings

[67]:
f'area = {pi} * {r}^2'
[67]:
'area = 3.141592653589793 * 2^2'
[68]:
x = 'area'
f'{x:=^10}'
[68]:
'===area==='

Templates

[69]:
from string import Template
[70]:
t = Template("$who likes $what")
items = [('ann', 'Python'), ('bob', 'R'), ('cody', 'C++')]
for name, lang in items:
    print(t.substitute(who=name, what=lang))
ann likes Python
bob likes R
cody likes C++
[71]:
items = [('ann', 'Python'), ('bob', 'R'), ('cody', 'C++')]
for name, lang in items:
    print("{} likes {}".format(name, lang))
ann likes Python
bob likes R
cody likes C++
[72]:
items = [('ann', 'Python'), ('bob', 'R'), ('cody', 'C++')]
for name, lang in items:
    print(f"{name} likes {lang}")
ann likes Python
bob likes R
cody likes C++

Encodings

  1. There ain’t no such thing as plain text

  2. Text is composed of

    1. Letters (Platonic ideal)

    2. Code points (an integer)

    3. Encodings (how the integer is written in memory)

  3. Python 3 defaults to Unicode UTF-8 encoding (Unicode code points in Python look like \uxxxx where x is hexadecimal)

  4. To see the bytes from an encoding, use the encode method

  5. To see the letter from bytes, use the decode method

  6. You can specify the encoding as an optional argument in the open function

  7. You can use Unicode in variable names

Unicode strings

[73]:
print('hello \u732b')
hello 猫
[74]:
s = '猫'
print(f'hello {s}')
hello 猫

image

Byte strings

[75]:
kitty = '小' + '猫'
[76]:
print(f'hello {kitty}')
hello 小猫
[77]:
kitty_bytes = kitty.encode('utf8')
kitty_bytes
[77]:
b'\xe5\xb0\x8f\xe7\x8c\xab'
[78]:
kitty_bytes.decode('utf8')
[78]:
'小猫'
[79]:
try:
    kitty_bytes.decode('ascii')
except UnicodeDecodeError as e:
    print(e)
'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)
[80]:
for suit in '\u2660 \u2665  \u2666  \u2663  \u2664  \u2661  \u2662  \u2667'.split():
    print(suit, end=',')
♠,♥,♦,♣,♤,♡,♢,♧,

Unicode variable names

[81]:
αβγδϵ = 23
[82]:
ΑΒΓΔΕ = 42
[83]:
(ŷ, ÿ, , ȳ, , , , ) = range(8)
[84]:
α⃗, α⃖, α⃡, α⃐, α⃑ = range(5)
[85]:
 = 'real'

Reading and writing text files

[86]:
%%file haiku.txt
古池や蛙飛び込む水の音
ふるいけやかわずとびこむみずのおと
Overwriting haiku.txt

Context manager

A context manager is a class that does some automatic activity on entry and exit. It can be written most easily using a standard library decorator.

[87]:
class CMDemo:
    """Demo of context manager."""

    def __init__(self, name):
        self.name = name

    def __enter__(self):
        print("Entering %s" % self.name)

    def __exit__(self, *args):
        print("Exiting %s" % self.name)
[88]:
with CMDemo('foo'):
    print('foo')
Entering foo
foo
Exiting foo
[89]:
from contextlib import contextmanager

@contextmanager
def tag(name):
    print("<%s>" % name)
    yield
    print("</%s>" % name)
[90]:
with tag('foo'):
    print('Hello')
<foo>
Hello
</foo>

It is good practice to use open as a context manager for file I/O so we don’t forget to close it. The old practice looks something like

f = open('foo.txt')
# do stuff with f over many lines
f.close()

The trouble is that people forget to close f and there are only a finite number of file handlers provided by the operating system, and programs can crash if that number is exceeded.

[91]:
with open('haiku.txt') as f:
    for line in f:
        print(line, end='')
古池や蛙飛び込む水の音
ふるいけやかわずとびこむみずのおと
[92]:
with open('haiku.txt') as f:
    haiku = f.read()
[93]:
haiku
[93]:
'古池や蛙飛び込む水の音\nふるいけやかわずとびこむみずのおと\n'
[94]:
haiku.split()
[94]:
['古池や蛙飛び込む水の音', 'ふるいけやかわずとびこむみずのおと']
[95]:
with open('haiku_alt.txt', 'w') as f:
    f.write(haiku)
[96]:
! cat haiku_alt.txt
古池や蛙飛び込む水の音
ふるいけやかわずとびこむみずのおと

Using regular expressions

golf

[97]:
import re

Matching Characters

[98]:
beer = '''99 bottles of Beer on the wall, 99 bottles of beeR.
Take one down and pass it around, 98 bottles of beer on the wall.'''
[99]:
re.findall('beer', beer)
[99]:
['beer']
[100]:
re.findall('beer', beer, re.IGNORECASE)
[100]:
['Beer', 'beeR', 'beer']
[101]:
re.findall('on', beer)
[101]:
['on', 'on', 'on']

Alternatives

[102]:
re.findall('bottles|beer', beer, re.IGNORECASE)
[102]:
['bottles', 'Beer', 'bottles', 'beeR', 'bottles', 'beer']

Word boundaries

[103]:
re.findall(r'\bon\b', beer)
[103]:
['on', 'on']
[104]:
re.findall(r'.', beer)[-10:]
[104]:
[' ', 't', 'h', 'e', ' ', 'w', 'a', 'l', 'l', '.']

Character sets

[105]:
re.findall(r'\d', beer)
[105]:
['9', '9', '9', '9', '9', '8']
[106]:
re.findall(r'[0-9]', beer)
[106]:
['9', '9', '9', '9', '9', '8']
[107]:
re.findall(r'\w', beer)[11:25]
[107]:
['B', 'e', 'e', 'r', 'o', 'n', 't', 'h', 'e', 'w', 'a', 'l', 'l', '9']

Repeating Things

[108]:
re.findall(r'\d+', beer)
[108]:
['99', '99', '98']
[109]:
re.findall(r'b.+r', beer)
[109]:
['bottles of Beer', 'bottles of beer']
[110]:
re.findall(r'be+', beer)
[110]:
['bee', 'bee']
[111]:
re.findall(r'be*', beer)
[111]:
['b', 'b', 'bee', 'b', 'bee']
[112]:
re.findall(r'b[aeiou]+', beer)
[112]:
['bo', 'bo', 'bee', 'bo', 'bee']
[113]:
re.findall(r'b[aeiou]{2,}', beer)
[113]:
['bee', 'bee']
[114]:
re.findall(r'b[aeiou]{1}', beer)
[114]:
['bo', 'bo', 'be', 'bo', 'be']

Finding matches

[115]:
for m in re.finditer('beer', beer, re.IGNORECASE):
    print(m.start(), m.end(), m.span(),  m.group())
14 18 (14, 18) Beer
46 50 (46, 50) beeR
100 104 (100, 104) beer

Grouping

[116]:
re.findall(r'(\d+)\s+(\b\w+?\b)', beer, re.IGNORECASE)
[116]:
[('99', 'bottles'), ('99', 'bottles'), ('98', 'bottles')]

Splitting

[117]:
re.split(r'\d+', beer)
[117]:
['',
 ' bottles of Beer on the wall, ',
 ' bottles of beeR.\nTake one down and pass it around, ',
 ' bottles of beer on the wall.']

Search and replace

[118]:
print(re.sub('beer', 'whiskey', beer, flags=re.IGNORECASE))
99 bottles of whiskey on the wall, 99 bottles of whiskey.
Take one down and pass it around, 98 bottles of whiskey on the wall.
[119]:
print(re.sub(r'(\d+)\s+(\b\w+?\b)', r'\2 \1', beer, re.IGNORECASE))
bottles 99 of Beer on the wall, bottles 99 of beeR.
Take one down and pass it around, 98 bottles of beer on the wall.

Function versus compiled method

[120]:
pattern = re.compile(r'(\d+)\s+(\b\w+?\b)')
pattern.findall(beer)
[120]:
[('99', 'bottles'), ('99', 'bottles'), ('98', 'bottles')]

Raw strings

The backslash \ is an escape character in a regular Python string. So we need to escape it to match a literal \. However, \ is an escape character in the regular expression mini-language when compiling the regular expression pattern. So we need to escape at two levels - hence we need \\\\ to match a literal \. The raw string rfoo treats \ as a literal character rather than an escape character.

[121]:
latex = 'latex uses \section over and over again like so \section'
[122]:
re.findall('\section', latex)
[122]:
[]
[123]:
re.findall('\\section', latex)
[123]:
[]
[124]:
re.findall('\\\\section', latex)
[124]:
['\\section', '\\section']
[125]:
re.findall(r'\\section', latex)
[125]:
['\\section', '\\section']

Examples

Removing punctuation

[126]:
ss = 'What the #$@&%*! does your code mean?'

Using a comprehension

[127]:
''.join(s for s in ss if not s in string.punctuation)
[127]:
'What the  does your code mean'

Using a built-in function

[128]:
ss.translate(str.maketrans('','', string.punctuation))
[128]:
'What the  does your code mean'

Using a regular expression

[129]:
pat = re.compile('[%s]' % re.escape(string.punctuation))
[130]:
pat.sub('', ss)
[130]:
'What the  does your code mean'

Timing

[131]:
%timeit ''.join(s for s in ss if not s in string.punctuation)
6.65 µs ± 25.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
[132]:
%timeit ss.translate(str.maketrans('','', string.punctuation))
4.25 µs ± 32.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
[133]:
%timeit pat.sub('', ss)
2.17 µs ± 25.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Custom version of capwords

[134]:
string.capwords('hello    world')
[134]:
'Hello World'
[135]:
def my_capwords(ss):
    return ' '.join([s.title() for s in ss.split()])
[136]:
my_capwords('hello    world')
[136]:
'Hello World'

Bag of words

Create a table of counts, where rows represent unique words and columns represent different documents. Ignore case and capitalization.

[137]:
doc1 = """The wheels on the bus go,
Round and round,
Round and round,
Round and round.
The wheels on the bus go
Round and round,
All through the town."""

doc2 = """The doors on the bus go,
Open and shut,
Open and shut,
Open and shut.
The doors on the bus go
Open and shut,
All through the town."""

doc3 = """The Driver on the bus says,
"Move on back!
Move on back!
Move on back!"
The Driver on the bus says,
"Move on back!"
All through the town."""

doc4 = """The babies on the bus go,
"Wah, wah, wah!
Wah, wah, wah!
Wah, wah, wah!"
The babies on the bus go,
"Wah, wah, wah!"
All through the town."""
[138]:
docs = [doc1, doc2, doc3, doc4]
doc_words = [doc.strip().lower().translate(str.maketrans('', '', string.punctuation)).split()
             for doc in docs]
words = [word for words in doc_words for word in words]
vocab = set(words)
[139]:
import numpy as np
import pandas as pd
[140]:
table = np.zeros((len(vocab), len(docs)), dtype='int')
[141]:
for i, word in enumerate(vocab):
    for j, doc in enumerate(doc_words):
        table[i, j] = doc.count(word)
[142]:
pd.DataFrame(table, columns='doc1 doc2 doc3 doc4'.split(), index=vocab)
[142]:
doc1 doc2 doc3 doc4
town 1 1 1 1
round 8 0 0 0
doors 0 2 0 0
says 0 0 2 0
on 2 2 6 2
and 4 4 0 0
through 1 1 1 1
open 0 4 0 0
move 0 0 4 0
back 0 0 4 0
babies 0 0 0 2
wah 0 0 0 12
bus 2 2 2 2
the 5 5 5 5
driver 0 0 2 0
shut 0 4 0 0
all 1 1 1 1
wheels 2 0 0 0
go 2 2 0 2
[ ]: