War of Worlds Regular Expressions

Upconverting Speakers

Document Analysis Identified Pattern: Speakers appear at the start of a line; Speakers are all capital letters; Some speakers contain spaces

Find: ^[A-Z ]+:
^ indicates start of a line
[A-Z ]+: square brackets indicate a character set and in particular the range of one or more (indicated by the plus sign) upper-case letters or white spaces followed by a colon

To remove the colon (pseudo-markup) when we replace the plain text with XML markup we need to surround the portion of the pattern we want to keep (the speaker names) in a capturing group.
Find: ^([A-Z ]+):

Replace: <speaker>\1</speaker>
\1 calls on the portion in the find expression that is in the capturing group - ([A-Z ]+)

Upconverting Stage Directions

Document Analysis Identified Pattern: Stage directions appear in parentheses; stage directions are in upper-case

Find: \([^a-z]+\)
\( and \) because we use parentheses in regular expressions to indicate capturing groups we need to escape the parentheses in order to grab literal parentheses in the text
[^a-z]+ the caret (^) at the start of the character set indicates not; therefore, this expression looks for one or more (indicated by the plus sign) of any character that is NOT lower-case

To remove the parentheses (pseudo-markup) when we replace the plain text with XML markup we need to surround the portion of the pattern we want to keep (just the text of the stage directions) in a capturing group.
Find: \(([^a-z]+)\)

Replace: <stage>\1</stage>
\1 calls on the portion in the find expression that is in the capturing group - ([^a-z]+)

Upconverting Paragraphs in Speeches

-Attempt-
Find: ^[^<]+\n\n
^ indicates start of a line and $ indicates end of line
\n indicates line returns - end of line and start of next
[^<]+ the caret (^) at the start of the character set indicates not; therefore, this expression looks for one or more (indicated by the plus sign) of any character that is NOT a left angle bracket (<)

Find: ^([^<]+?)$
^ indicates start of a line and $ indicates end of line
Replace: <p>\1</p>

Upconverting Paragraphs in Speeches

Using the close open technique - “clopen”
Find: <speaker>
Replace: </sp><sp>\0
take the closing tag from beginning of first speech and place at the end of the last speech