I would like to extract movie script dialogues like so:
- UPPERCAPS Character Names
- Dialog followed up until line-break to avoid snatching in the narration as well.
Current Regex: ((s[^w].s[A-Z]+)n+.+)
Problem is, it only extracts the character name and the first sentence from the dialog.
Here’s the testing data:
ADAM Help! Someone help me! (He stops when he hears a loud dragging sound somewhere in the room. He looks out into the darkness and calls out.) Is someone there? Hey! (He turns back to the corner to which he is chained, says in a slightly softer but still panicked voice) Shit, Im probably dead. Suddenly, from out within the darkness comes a mans low, raspy voice. It startles Adam. The voice, we will soon learn, belongs to LAWRENCE. LAWRENCE Youre not dead. Adam quickly turns in the direction of the voice. Holding his arms out for balance, he tries to look across the room to whoever is speaking, but still cannot see a thing. ADAM Whos that? Whos that?! LAWRENCE (his voice strangely a bit on the calm side, and almost irritated with Adams reaction to the situation. This shall be his tone for many scenes to come.) Theres no point in yelling, I already tried it. ADAM Turn on the lights! LAWRENCE Would if I could. ADAM What the fuck is going on? Where am I? (He turns into his corner, touching the wall.) LAWRENCE I dont know yet. ADAM (smelling something; in disgust) What is that smell? LAWRENCE Shh! Hang on a second, I think I found something. With a loud click and an even louder buzzing sound, the very bright fluorescent lights come to life, lighting up in rows, starting from Lawrences end and moving towards Adam. As they come on, Adam is nearly blinded by the sudden change from pitch black to bright white and squints in pain, holding up his arms to cover his face. In the light we now see that he is in his mid-twenties, with short brown hair, wearing a dark blue striped shirt over a white tee shirt and jeans, looking like a drowned rat from the tub. It takes him a moment but his eyes finally start to adjust, and he looks around the room. He and we see Lawrence, who also winces from the glare of the lights, standing by the light switch and the door. He is on the opposite end of the room, also chained to a pipe in the corner by his foot. He wears a blue button-down dress shirt, now soaked with sweat stains. He is middle aged, mid to late forties, with pale blonde hair and even paler skin. Dark circles are under his eyes. Both men are barefoot. Lawrences eyes adjust to the light and he sees across the room. Then, his gaze starts towards the center of the room, as does Adams, who steps forward as much as he can, a look of horror on his face. We see lying face down the body of a man who has blown his brains out, lying in a pool of blood, clad in only boxer shorts and a tee shirt. In his left hand is a gun, in his right hand is a micro cassette recorder. A gunshot and a scream are heard as the camera moves up and in a fast 360° angle above and circling the man, ending in a full overhead view of him. The shot cuts to Adam, who reels in shock and disgust. ADAM Holy shit! He turns towards the tub and leans over, gagging and coughing. Lawrence in the meantime hops forward the best he can, studying the body with a look of fear and concern. Adam stops coughing and turns back around, takes another look at the body and around the room. He looks down at his chain then starts to completely freak out, grabbing and pulling at his chain. ADAM (screaming) HELP!!! (He falls back onto his bottom on the floor as he yanks at the chain as hard as he can.) HELP!!! Help! Lawrence just stands and watches him with an almost embarrassed, appalled look at his behaviour. It seems that Lawrence, despite being in the situation hes in, is above that kind of uncontrolled reaction. He speaks a bit coldly. LAWRENCE SANDLER No one can hear you.
EDIT
New Regex: (w[A-Z]+ns).+?(?=n)
Advertisement
Answer
You can use the following the regex:
(?:([A-Z]+ *[A-Z]+)n).*?(?=$|([A-Z]+ *[A-Z]+)n)
Explanation:
(?:([A-Z]+ *[A-Z]+)n)
- Non capturing group matches only capital words with 0 or more spaces allowed in between followed by line ending
.*?
- Matches anything as less as possible
(?=$|([A-Z]+ *[A-Z]+)n)
- Positive lookahead to assert either the end or starting of new dialogue as explained above.