Skip to content
Advertisement

Extracting Dialogs from movie scripts using Regex

I would like to extract movie script dialogues like so:

  • UPPERCAPS Character Names
  • Dialog followed up until line-break to avoid snatching in the narration as well.

Current Regex: ((s[^w].s[A-Z]+)n+.+)

Problem is, it only extracts the character name and the first sentence from the dialog.

Here’s the testing data:

                                  ADAM
                         Help! Someone help me! (He stops when 
                         he hears a loud dragging sound somewhere 
                         in the room. He looks out into the darkness 
                         and calls out.) Is someone there? Hey! 
                         (He turns back to the corner to which 
                         he is chained, says in a slightly softer 
                         but still panicked voice) Shit, Im 
                         probably dead.

               Suddenly, from out within the darkness comes a mans low, raspy 
               voice. It startles Adam. The voice, we will soon learn, belongs 
               to LAWRENCE.

                                     LAWRENCE
                         Youre not dead.

               Adam quickly turns in the direction of the voice. Holding his 
               arms out for balance, he tries to look across the room to whoever 
               is speaking, but still cannot see a thing.

                                     ADAM
                         Whos that? Whos that?!

                                     LAWRENCE
                         (his voice strangely a bit on the calm 
                         side, and almost irritated with Adams 
                         reaction to the situation. This shall 
                         be his tone for many scenes to come.) 
                         Theres no point in yelling, I already 
                         tried it.

                                     ADAM
                         Turn on the lights!

                                     LAWRENCE
                         Would if I could.

                                     ADAM
                         What the fuck is going on? Where am 
                         I? (He turns into his corner, touching 
                         the wall.)

                                     LAWRENCE
                         I dont know yet.

                                     ADAM
                         (smelling something; in disgust) What 
                         is that smell?

                                     LAWRENCE
                         Shh! Hang on a second, I think I found 
                         something.

               With a loud click and an even louder buzzing sound, the very 
               bright fluorescent lights come to life, lighting up in rows, 
               starting from Lawrences end and moving towards Adam. As they 
               come on, Adam is nearly blinded by the sudden change from pitch 
               black to bright white and squints in pain, holding up his arms 
               to cover his face. In the light we now see that he is in his 
               mid-twenties, with short brown hair, wearing a dark blue striped 
               shirt over a white tee shirt and jeans, looking like a drowned 
               rat from the tub. It takes him a moment but his eyes finally 
               start to adjust, and he looks around the room. He and we see 
               Lawrence, who also winces from the glare of the lights, standing 
               by the light switch and the door. He is on the opposite end of 
               the room, also chained to a pipe in the corner by his foot. He 
               wears a blue button-down dress shirt, now soaked with sweat stains. 
               He is middle aged, mid to late forties, with pale blonde hair 
               and even paler skin. Dark circles are under his eyes. Both men 
               are barefoot.

               Lawrences eyes adjust to the light and he sees across the room. 
               Then, his gaze starts towards the center of the room, as does 
               Adams, who steps forward as much as he can, a look of horror 
               on his face. We see lying face down the body of a man who has 
               blown his brains out, lying in a pool of blood, clad in only 
               boxer shorts and a tee shirt. In his left hand is a gun, in his 
               right hand is a micro cassette recorder. A gunshot and a scream 
               are heard as the camera moves up and in a fast 360° angle above 
               and circling the man, ending in a full overhead view of him.


               The shot cuts to Adam, who reels in shock and disgust.

                                     ADAM
                         Holy shit!

               He turns towards the tub and leans over, gagging and coughing. 
               Lawrence in the meantime hops forward the best he can, studying 
               the body with a look of fear and concern. Adam stops coughing 
               and turns back around, takes another look at the body and around 
               the room. He looks down at his chain then starts to completely 
               freak out, grabbing and pulling at his chain.

                                     ADAM
                         (screaming) HELP!!! (He falls back onto 
                         his bottom on the floor as he yanks 
                         at the chain as hard as he can.) HELP!!! 
                         Help!

               Lawrence just stands and watches him with an almost embarrassed, 
               appalled look at his behaviour. It seems that Lawrence, despite 
               being in the situation hes in, is above that kind of uncontrolled 
               reaction. He speaks a bit coldly.

                                     LAWRENCE SANDLER
                         No one can hear you.

EDIT

New Regex: (w[A-Z]+ns).+?(?=n)

Advertisement

Answer

You can use the following the regex:

(?:([A-Z]+ *[A-Z]+)n).*?(?=$|([A-Z]+ *[A-Z]+)n)

Explanation:

(?:([A-Z]+ *[A-Z]+)n)

  • Non capturing group matches only capital words with 0 or more spaces allowed in between followed by line ending

.*?

  • Matches anything as less as possible

(?=$|([A-Z]+ *[A-Z]+)n)

  • Positive lookahead to assert either the end or starting of new dialogue as explained above.

Demo

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement