Knowledge Graphs

Exercise 5.2 - Word Shape

An important feature used for entity extraction is Word shape: it represents the abstract letter pattern of the word by mapping lower-case letters to x, upper-case to X, numbers to d, and retaining punctuation. Thus for example C.I.A. would map to X.X.X. and IRS-1040 would map to XXX-dddd. In a shorter-version of word shape, consecutive character types are removed. For example, C.I.A. would still map to X.X.X, but IRS-1040 would map to X-d. With these definitions, address the following questions.

a. What is the shape of the word: Googenheim?
b. What is the short-shape of the word: Googenheim?
c. What is the regular expression for the shape of the word Googenheim?
d. What is the regular expression for the short shape of the word Googenheim?
e. Is it true that the short-shape is always strictly smaller than the regular shape of a word?