Text¶
Objectives¶
[ ] String indexing and immutability
[ ] The
string
module[ ] Manipulating text strings
[ ] Formatting text strings
[ ] Text I/O
[ ] Working with Unicode
[ ] Text to arrays
Strings¶
Point index
Interval index
Negative index
Stride
Reversing a string
Strings are immutable
References¶
[1]:
s = "hello world"
[2]:
s[0], s[6]
[2]:
('h', 'w')
[3]:
s[0:6]
[3]:
'hello '
[4]:
s[-1], s[-3]
[4]:
('d', 'r')
[5]:
s[::2]
[5]:
'hlowrd'
[6]:
s[::-1]
[6]:
'dlrow olleh'
[7]:
try:
s[0] = 'H'
except TypeError as e:
print(e)
'str' object does not support item assignment
The string
module¶
String constants
String
capwords
[8]:
import string
[9]:
string.digits
[9]:
'0123456789'
[10]:
string.ascii_letters
[10]:
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
[11]:
string.punctuation
[11]:
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
[12]:
string.whitespace
[12]:
' \t\n\r\x0b\x0c'
[13]:
string.printable
[13]:
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'
String methods¶
Methods to change case¶
[15]:
s.upper()
[15]:
'HELLO WORLD'
[16]:
s.lower()
[16]:
'hello world'
[17]:
'ß'.casefold()
[17]:
'ss'
[18]:
s.capitalize()
[18]:
'Hello world'
[19]:
s.title()
[19]:
'Hello World'
Difference between title
method and capwords
function
[20]:
'hello:world'.title()
[20]:
'Hello:World'
[21]:
string.capwords('hello:world')
[21]:
'Hello:world'
[22]:
string.capwords('hello:world', sep=':')
[22]:
'Hello:World'
String predicates¶
[23]:
s.isalnum()
[23]:
False
[24]:
s.isalpha()
[24]:
False
[25]:
s.isnumeric()
[25]:
False
[26]:
s.isidentifier()
[26]:
False
[27]:
s.isprintable()
[27]:
True
[28]:
s.startswith('hell')
[28]:
True
[29]:
s.endswith('ld')
[29]:
True
Searching and counting¶
[30]:
'llo' in s
[30]:
True
[31]:
'foo' in s
[31]:
False
[32]:
s.find('llo')
[32]:
2
[33]:
s.index('llo')
[33]:
2
[34]:
s.find('foo')
[34]:
-1
[35]:
try:
s.index('foo')
except ValueError as e:
print(e)
substring not found
[36]:
s.count('l')
[36]:
3
[37]:
s.count('ll')
[37]:
1
Stripping¶
[38]:
' hello world '.strip()
[38]:
'hello world'
[39]:
' hello world '.lstrip()
[39]:
'hello world '
[40]:
' hello world '.rstrip()
[40]:
' hello world'
Splitting and joining¶
[41]:
s.split()
[41]:
['hello', 'world']
[42]:
s.split('l')
[42]:
['he', '', 'o wor', 'd']
[43]:
'-'.join(s)
[43]:
'h-e-l-l-o- -w-o-r-l-d'
[44]:
'-'.join(s.split())
[44]:
'hello-world'
[45]:
'l'.join(s.split('l'))
[45]:
'hello world'
Translation¶
[46]:
'GATTACA'.translate(str.maketrans('ACTG', 'TAGC'))
[46]:
'CTGGTAT'
[47]:
'GATTACA'.translate(str.maketrans('', '', 'AC'))
[47]:
'GTT'
[48]:
'GATTACA'.translate(str.maketrans(string.ascii_uppercase, string.ascii_lowercase))
[48]:
'gattaca'
ord
and chr
¶
[49]:
ord('A'), ord('a')
[49]:
(65, 97)
[50]:
chr(65), chr(97)
[50]:
('A', 'a')
[51]:
chr(ord('B') + (ord('a') - ord('A')))
[51]:
'b'
Formatting strings¶
C sytle formatting¶
[52]:
pi = 3.141592653589793
r = 2
[53]:
'area = %f * %d^2' % (pi, r)
[53]:
'area = 3.141593 * 2^2'
Precision and padding
[54]:
'area = %8.2f * %03d^2' % (pi, r)
[54]:
'area = 3.14 * 002^2'
Right align string
[55]:
'%10s = %8.2f * %03d^2' % ('area', pi, r)
[55]:
' area = 3.14 * 002^2'
Left align string
[56]:
'%-10s = %8.2f * %03d^2' % ('area', pi, r)
[56]:
'area = 3.14 * 002^2'
Using the format
method¶
"{" [field_name] ["!" conversion] [":" format_spec] "}"
[57]:
'{big:,}'.format(big=int(1e9))
[57]:
'1,000,000,000'
[58]:
'{pct:.1%}'.format(pct=0.5)
[58]:
'50.0%'
[59]:
'area = {} * {}^2'.format(pi, r)
[59]:
'area = 3.141592653589793 * 2^2'
[60]:
'area = {a} * {b}^2'.format(a=pi, b=r)
[60]:
'area = 3.141592653589793 * 2^2'
[61]:
'area = {pi:8,.4} * {r:06d}^2'.format(pi=pi, r=r)
[61]:
'area = 3.142 * 000002^2'
[62]:
'{:>10}'.format('area')
[62]:
' area'
[63]:
'{:<10}'.format('area')
[63]:
'area '
[64]:
'{:^10}'.format('area')
[64]:
' area '
[65]:
'{:=^10}'.format('area')
[65]:
'===area==='
[66]:
import datetime
now = datetime.datetime.now()
'{:%a, %d %b %Y: %H:%M %p}'.format(now)
[66]:
'Mon, 13 Jan 2020: 10:41 AM'
Using f strings¶
[67]:
f'area = {pi} * {r}^2'
[67]:
'area = 3.141592653589793 * 2^2'
[68]:
x = 'area'
f'{x:=^10}'
[68]:
'===area==='
Templates¶
[69]:
from string import Template
[70]:
t = Template("$who likes $what")
items = [('ann', 'Python'), ('bob', 'R'), ('cody', 'C++')]
for name, lang in items:
print(t.substitute(who=name, what=lang))
ann likes Python
bob likes R
cody likes C++
[71]:
items = [('ann', 'Python'), ('bob', 'R'), ('cody', 'C++')]
for name, lang in items:
print("{} likes {}".format(name, lang))
ann likes Python
bob likes R
cody likes C++
[72]:
items = [('ann', 'Python'), ('bob', 'R'), ('cody', 'C++')]
for name, lang in items:
print(f"{name} likes {lang}")
ann likes Python
bob likes R
cody likes C++
Encodings¶
There ain’t no such thing as plain text
Text is composed of
Letters (Platonic ideal)
Code points (an integer)
Encodings (how the integer is written in memory)
Python 3 defaults to Unicode UTF-8 encoding (Unicode code points in Python look like
\uxxxx
wherex
is hexadecimal)To see the bytes from an encoding, use the
encode
methodTo see the letter from bytes, use the
decode
methodYou can specify the encoding as an optional argument in the
open
functionYou can use Unicode in variable names
Byte strings¶
[75]:
kitty = '小' + '猫'
[76]:
print(f'hello {kitty}')
hello 小猫
[77]:
kitty_bytes = kitty.encode('utf8')
kitty_bytes
[77]:
b'\xe5\xb0\x8f\xe7\x8c\xab'
[78]:
kitty_bytes.decode('utf8')
[78]:
'小猫'
[79]:
try:
kitty_bytes.decode('ascii')
except UnicodeDecodeError as e:
print(e)
'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)
[80]:
for suit in '\u2660 \u2665 \u2666 \u2663 \u2664 \u2661 \u2662 \u2667'.split():
print(suit, end=',')
♠,♥,♦,♣,♤,♡,♢,♧,
Unicode variable names¶
[81]:
αβγδϵ = 23
[82]:
ΑΒΓΔΕ = 42
[83]:
(ŷ, ÿ, ỹ, ȳ, y̅, y̆, y̌, y̲) = range(8)
[84]:
α⃗, α⃖, α⃡, α⃐, α⃑ = range(5)
[85]:
ℜ = 'real'
Reading and writing text files¶
[86]:
%%file haiku.txt
古池や蛙飛び込む水の音
ふるいけやかわずとびこむみずのおと
Overwriting haiku.txt
Context manager¶
A context manager is a class that does some automatic activity on entry and exit. It can be written most easily using a standard library decorator.
[87]:
class CMDemo:
"""Demo of context manager."""
def __init__(self, name):
self.name = name
def __enter__(self):
print("Entering %s" % self.name)
def __exit__(self, *args):
print("Exiting %s" % self.name)
[88]:
with CMDemo('foo'):
print('foo')
Entering foo
foo
Exiting foo
[89]:
from contextlib import contextmanager
@contextmanager
def tag(name):
print("<%s>" % name)
yield
print("</%s>" % name)
[90]:
with tag('foo'):
print('Hello')
<foo>
Hello
</foo>
It is good practice to use open
as a context manager for file I/O so we don’t forget to close it. The old practice looks something like
f = open('foo.txt')
# do stuff with f over many lines
f.close()
The trouble is that people forget to close f
and there are only a finite number of file handlers provided by the operating system, and programs can crash if that number is exceeded.
[91]:
with open('haiku.txt') as f:
for line in f:
print(line, end='')
古池や蛙飛び込む水の音
ふるいけやかわずとびこむみずのおと
[92]:
with open('haiku.txt') as f:
haiku = f.read()
[93]:
haiku
[93]:
'古池や蛙飛び込む水の音\nふるいけやかわずとびこむみずのおと\n'
[94]:
haiku.split()
[94]:
['古池や蛙飛び込む水の音', 'ふるいけやかわずとびこむみずのおと']
[95]:
with open('haiku_alt.txt', 'w') as f:
f.write(haiku)
[96]:
! cat haiku_alt.txt
古池や蛙飛び込む水の音
ふるいけやかわずとびこむみずのおと
Using regular expressions¶
Practice at https://regex101.com
Play RegEx Golf
[97]:
import re
Matching Characters¶
[98]:
beer = '''99 bottles of Beer on the wall, 99 bottles of beeR.
Take one down and pass it around, 98 bottles of beer on the wall.'''
[99]:
re.findall('beer', beer)
[99]:
['beer']
[100]:
re.findall('beer', beer, re.IGNORECASE)
[100]:
['Beer', 'beeR', 'beer']
[101]:
re.findall('on', beer)
[101]:
['on', 'on', 'on']
Alternatives¶
[102]:
re.findall('bottles|beer', beer, re.IGNORECASE)
[102]:
['bottles', 'Beer', 'bottles', 'beeR', 'bottles', 'beer']
Word boundaries¶
[103]:
re.findall(r'\bon\b', beer)
[103]:
['on', 'on']
[104]:
re.findall(r'.', beer)[-10:]
[104]:
[' ', 't', 'h', 'e', ' ', 'w', 'a', 'l', 'l', '.']
Character sets¶
[105]:
re.findall(r'\d', beer)
[105]:
['9', '9', '9', '9', '9', '8']
[106]:
re.findall(r'[0-9]', beer)
[106]:
['9', '9', '9', '9', '9', '8']
[107]:
re.findall(r'\w', beer)[11:25]
[107]:
['B', 'e', 'e', 'r', 'o', 'n', 't', 'h', 'e', 'w', 'a', 'l', 'l', '9']
Repeating Things¶
[108]:
re.findall(r'\d+', beer)
[108]:
['99', '99', '98']
[109]:
re.findall(r'b.+r', beer)
[109]:
['bottles of Beer', 'bottles of beer']
[110]:
re.findall(r'be+', beer)
[110]:
['bee', 'bee']
[111]:
re.findall(r'be*', beer)
[111]:
['b', 'b', 'bee', 'b', 'bee']
[112]:
re.findall(r'b[aeiou]+', beer)
[112]:
['bo', 'bo', 'bee', 'bo', 'bee']
[113]:
re.findall(r'b[aeiou]{2,}', beer)
[113]:
['bee', 'bee']
[114]:
re.findall(r'b[aeiou]{1}', beer)
[114]:
['bo', 'bo', 'be', 'bo', 'be']
Finding matches¶
[115]:
for m in re.finditer('beer', beer, re.IGNORECASE):
print(m.start(), m.end(), m.span(), m.group())
14 18 (14, 18) Beer
46 50 (46, 50) beeR
100 104 (100, 104) beer
Grouping¶
[116]:
re.findall(r'(\d+)\s+(\b\w+?\b)', beer, re.IGNORECASE)
[116]:
[('99', 'bottles'), ('99', 'bottles'), ('98', 'bottles')]
Splitting¶
[117]:
re.split(r'\d+', beer)
[117]:
['',
' bottles of Beer on the wall, ',
' bottles of beeR.\nTake one down and pass it around, ',
' bottles of beer on the wall.']
Search and replace¶
[118]:
print(re.sub('beer', 'whiskey', beer, flags=re.IGNORECASE))
99 bottles of whiskey on the wall, 99 bottles of whiskey.
Take one down and pass it around, 98 bottles of whiskey on the wall.
[119]:
print(re.sub(r'(\d+)\s+(\b\w+?\b)', r'\2 \1', beer, re.IGNORECASE))
bottles 99 of Beer on the wall, bottles 99 of beeR.
Take one down and pass it around, 98 bottles of beer on the wall.
Function versus compiled method¶
[120]:
pattern = re.compile(r'(\d+)\s+(\b\w+?\b)')
pattern.findall(beer)
[120]:
[('99', 'bottles'), ('99', 'bottles'), ('98', 'bottles')]
Raw strings¶
The backslash \
is an escape character in a regular Python string. So we need to escape it to match a literal \
. However, \
is an escape character in the regular expression mini-language when compiling the regular expression pattern. So we need to escape at two levels - hence we need \\\\
to match a literal \
. The raw string rfoo
treats \
as a literal character rather than an escape character.
[121]:
latex = 'latex uses \section over and over again like so \section'
[122]:
re.findall('\section', latex)
[122]:
[]
[123]:
re.findall('\\section', latex)
[123]:
[]
[124]:
re.findall('\\\\section', latex)
[124]:
['\\section', '\\section']
[125]:
re.findall(r'\\section', latex)
[125]:
['\\section', '\\section']
Examples¶
Removing punctuation¶
[126]:
ss = 'What the #$@&%*! does your code mean?'
Using a comprehension
[127]:
''.join(s for s in ss if not s in string.punctuation)
[127]:
'What the does your code mean'
Using a built-in function
[128]:
ss.translate(str.maketrans('','', string.punctuation))
[128]:
'What the does your code mean'
Using a regular expression
[129]:
pat = re.compile('[%s]' % re.escape(string.punctuation))
[130]:
pat.sub('', ss)
[130]:
'What the does your code mean'
Timing¶
[131]:
%timeit ''.join(s for s in ss if not s in string.punctuation)
6.65 µs ± 25.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
[132]:
%timeit ss.translate(str.maketrans('','', string.punctuation))
4.25 µs ± 32.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
[133]:
%timeit pat.sub('', ss)
2.17 µs ± 25.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Custom version of capwords
¶
[134]:
string.capwords('hello world')
[134]:
'Hello World'
[135]:
def my_capwords(ss):
return ' '.join([s.title() for s in ss.split()])
[136]:
my_capwords('hello world')
[136]:
'Hello World'
Bag of words¶
Create a table of counts, where rows represent unique words and columns represent different documents. Ignore case and capitalization.
[137]:
doc1 = """The wheels on the bus go,
Round and round,
Round and round,
Round and round.
The wheels on the bus go
Round and round,
All through the town."""
doc2 = """The doors on the bus go,
Open and shut,
Open and shut,
Open and shut.
The doors on the bus go
Open and shut,
All through the town."""
doc3 = """The Driver on the bus says,
"Move on back!
Move on back!
Move on back!"
The Driver on the bus says,
"Move on back!"
All through the town."""
doc4 = """The babies on the bus go,
"Wah, wah, wah!
Wah, wah, wah!
Wah, wah, wah!"
The babies on the bus go,
"Wah, wah, wah!"
All through the town."""
[138]:
docs = [doc1, doc2, doc3, doc4]
doc_words = [doc.strip().lower().translate(str.maketrans('', '', string.punctuation)).split()
for doc in docs]
words = [word for words in doc_words for word in words]
vocab = set(words)
[139]:
import numpy as np
import pandas as pd
[140]:
table = np.zeros((len(vocab), len(docs)), dtype='int')
[141]:
for i, word in enumerate(vocab):
for j, doc in enumerate(doc_words):
table[i, j] = doc.count(word)
[142]:
pd.DataFrame(table, columns='doc1 doc2 doc3 doc4'.split(), index=vocab)
[142]:
doc1 | doc2 | doc3 | doc4 | |
---|---|---|---|---|
town | 1 | 1 | 1 | 1 |
round | 8 | 0 | 0 | 0 |
doors | 0 | 2 | 0 | 0 |
says | 0 | 0 | 2 | 0 |
on | 2 | 2 | 6 | 2 |
and | 4 | 4 | 0 | 0 |
through | 1 | 1 | 1 | 1 |
open | 0 | 4 | 0 | 0 |
move | 0 | 0 | 4 | 0 |
back | 0 | 0 | 4 | 0 |
babies | 0 | 0 | 0 | 2 |
wah | 0 | 0 | 0 | 12 |
bus | 2 | 2 | 2 | 2 |
the | 5 | 5 | 5 | 5 |
driver | 0 | 0 | 2 | 0 |
shut | 0 | 4 | 0 | 0 |
all | 1 | 1 | 1 | 1 |
wheels | 2 | 0 | 0 | 0 |
go | 2 | 2 | 0 | 2 |
[ ]: