TV Pilots are a treasure trove of data

A few months ago, I did a little weekend project of looking at TV comedy pilot scripts. For those unfamiliar with the concept, when a television show is being developed a network will order a pilot episode as a test to see if it will pick it up for a full season. As a result, the idea may be reworked and elements changed to “make it work” for that network.

Part 1: Fetching and Normalizing the data

To start, I scraped about 450 television pilots from Here was my first challenge, some of these were just text files (awesome) but others were PDFs. In order to extract the text, I turned to Tesseract. Below is the script I used to extract the text:

for f in $(find . -name '*.pdf'); do
  PAGES=`pdfinfo $f | grep Pages: | awk '{print $2}' | tail -n 1`
  if [ ! -f textfiles/$parsedfilename.txt ]
    #some text was parsable just using pdftotext
    pdftotext -layout $f - > textfiles/$parsedfilename.txt
    echo "File $parsedfilename does not exists"
    for i in `seq 1 $PAGES`; do
      # converts the file to an image
      convert -density 500 -depth 8 $f\[$(($i - 1 ))\] images/page$i.png
      # tesseract parses the image for text and puts it into a file
      tesseract images/page$i.png stdout >> parsed/$parsedfilename.txt

This got most of the scripts in a format that could be queried. Here’s a sample of the very funny 30 Rock pilot – note the different character names –

The studio's homebase set. Workman are polishing a big
sign that reads, "Friday Night Bits with Jenna DeCarlo.
"Pull back through the picture window to where KENNETH a
bright and chirpy (Clay Aiken type) NBC page is giving a
tour. He stands next to-a life-size standee of impish
comedian Jenna DeCarlo. '

Part 2: Apache Spark analysis

Now that the data was machine readable, the best first course of action was to query the text files for data that I thought might be interesting.  Apache Spark is a great tool for loading up datasets like this so I went into the Spark shell and ran some different experiments. Here is some of the code I used to get to these numbers:

//loads both folders of the 450 comedy scripts into the RDD
var parsedFiles = sc.textFile("./tvscript/parsed,./tvscript/textfiles")
//outputs the count of the phrase "20s"
parsedFiles.filter(line => line.contains("20s")).count()

Exterior vs interior scenes

Screenplays are unique because of the way they are formatted, they announce whether a scene is interior or exterior at the beginning of the scene with either INT. or EXT. so I started there.


Screen Shot 2016-10-09 at 1.46.01 PM

My take: It is significantly cheaper to shoot indoors than outdoors, this might be a self-selection by writers to make sure their show gets picked up.


Age of characters

When announcing a character in a screenplay you usually give a short description which includes their age usually by decade, for example from The Grinder script “STEWART SANDERSON (30’s) drives with his family”.

Screen Shot 2016-10-09 at 1.48.06 PM


My Take: No surprises here, television is geared towards 24-54 and they want to show a good distribution of those people on TV.

Part 3: Sentiment Analysis

That was a fun experiment, but it was time to go further. In looking at the data, I realized I could do a sentiment analysis of block of text in an episode and see if there were any patterns that appeared. I created a new scala project focused on using Stanford’s natural language processing library and based on work done here. Each block of text was taken and analyzed then put into a MongoDB store with a structure that looks like this

  "sentiment" : 1,
  "textFile" : "Black-ish 1x01 - Pilot",
  "line" : " DIANE\n She’s weird, so feel free to say no.",
  "weight" : 263

Here the “sentiment” is a scale from 1-5 with 1 being most negative, “line” is the actual block of text and “weight” is what order it occurs in the episode, so this was the 263rd thing said in the episode. With the data in place, I built a small node server that could display a chart for the scripts I parsed. Here are some screenshots of the results


Screen Shot 2016-06-05 at 11.08.25 AM Screen Shot 2016-06-05 at 11.07.40 AM

Pretty neat right? Well the completely interactive version is located at where you can look at the 100+ scripts I did sentiment analysis for.

Some More Books I Read in 2016

When I started the year I had a goal that I would write a blurb about each book I finished this year however I’ve fallen way behind in the writing piece. As a form of catch-up (read as cheating), I’m going to quickly recap the books I’ve read since The Power Of Habit.

  • Born To Run – A great read while I trained for my first half marathon. I always enjoy reading about forgotten knowledge and for running long distances we’ve certainly forgotten more than we know Today.
  • Domain Driven Design Distilled – If you are building software applications, domain driven design is an approach to wrangling complex business logic by structuring your applications as your users will think and speak of them. This book and others will help you work through the brainstorming process for that as well some design patterns like CQRS and Event sourcing to aid in communicating changes in your domain driven services.
  • Food a love story – Jim Gaffigan writing about food, it is great.
  • Infrastructure as code – You’ll be inspired to rewrite your entire infrastructure so it’s not a set of hand-rolled fragile pets but an automated, repeatable, scalable infrastructure made of ready to kill at any moment servers.
  • Disrupted: My Misadventure in the Start-Up Bubble- A man in his 50s joins Hubspot and hilarity ensues. Of course about half through the book things take a dark turn. A great read for anyone who has worked at a start up and thinks sometimes that you are all going nuts.
  • Packing for Mars – An adventure of all of the weird stuff about going to space (how do they determine who has “the right stuff”?), floating in space (has anyone ever had sex in there?) and how we’ll ever colonize space (It’s probably like locking people in a space capsule on earth for months at a time… let’s see how that goes)
  • American Icon: Alan Mulally and the Fight to Save Ford –Alan Mulally took a company that by all rights should have been dead and rebuilt it by getting people to work with each other, focusing the product line, and solving problems of quality above all else.
  • The Psychopath Test – There is a test that professionals use to determine if someone is a psychopath. Jon Ronson explores whether he can learn to identify psychopaths by learning the test and finds out a lot about crazy people along the way.
  • The Etymologicon – My favorite feature of google is asking <WORD> etymology. This was basically 7 hours of that but a constant flow of words, it was so much fun.
  • Grunt – Military science rarely gets covered in the news, this was a fun way to learn about it. I saw Mary Roach speak about this book when it came out, she said she focused on the human stories and innovations that are largely ignored,
  • Omnivore’s Dilemma – It’s tough being humans. We  are making complicated trade-offs – organic, local, slow, fast, …whatever the trend, none of them are silver bullets and this book covers a lot of these trade-offs.
  • Sienfeldia- I’m a big Seinfeld fan and to hear all about how the ideas and stories came to light is fascinating. Can you imagine where we would all be without the big salad, yadda yadda yadda and the contest? Probably the darkest timeline.
  • Shoe dog- I didn’t know the story of Nike. To hear about the giant risks they took to create one of the largest shoe companies and one of the most recognizable brands of all time is a ton of fun. Reading this book after Born To Run was great as well because you can see how early Nike design decisions influenced the shoe industry that may lead to a backlash that turns into the barefoot running movement.
  • Spook – I’m clearly a fan of Roach’s writing – while this wasn’t my favorite of her works it’s still a good time to explore what happens after we die. Roach doesn’t pull punches as she explores reincarnation, people who communicate with the dead and even stories about past popes.
  • American nations a history of the eleven rival regional cultures of North America – This was probably the one book I read this summer where I couldn’t stop talking about it. “And see the nations have never got along it was just these 3 events where we pretended to” or “Can you believe that people from the Appalachian states have made up the most of our military since the revolution but only account for a small piece of our population?”  is how I annoyed Sara every time I put down the book. When I finished the book I was amazed because the first article I read was about the prison sentences being widely different county to county. You can read the full article here, but the map of where the harshest prison sentences aren’t red states and blue states but almost an exact map of the “Borderland” states.
  • The Phoenix project – A fictional book about DevOps, what will they think of next. But you know what despite the deus ex machina up the wazoo – oh how convenient that the factory down the street has the exact same problem as our infrastructure team! – it was fun to see XP principles and DevOps applied and how it can transform an organization.
  • Notorious RBG – Ruth Bader Ginsburg’s journey to the Supreme Court and her impact on the court is inspiring. While the author tends to float into meme-heavy pieces about how cool RBG is, it was still a great story about a woman who has been steadfast in her fight to make our nation into a more perfect union.
  • The Ego Is the Enemy – Ryan Holiday goes deep throughout history and does a fantastic job finding examples where ego ruined a successful person. As much as we love Kanye, in looking at the data he is the exception not the rule.
  • Harry Potter and The Cursed Child – We’re going to see the plan in London next fall and after reading through the play, I can’t wait to see it on stage.


The Power of Habit: Why We Do What We Do in Life and Business By Charles Duhigg

At my old job, we used to have cans of Coca-Cola available in our fridge. I remember every day I would start to drag around 3 PM, and then the lovely sound of someone else opening a can would cause me to get up, drift to the fridge, open my own and drink it as fast as possible. By 3:30 I was ready to tackle that important piece of work I had to do.

This was a habit. It was set off by a cue, followed by a routine and punctuated with a reward. According to The Power of Habit, these are three elements of the habit loop. The book gives you the lowdown of how habits are built, ignored, exploited and how to break them. The author weaves together stories of Alcoholics Annoynomous, the Tampa Bay Buccaneers, early toothpaste salesmen and the launch of Febreeze to create a compelling tale. He reminds us that we can use habits for both good and bad and how they can be used to build better businesses and lives, a great read for those looking to create a “sticky” brand.

The Power of Habit: Why We Do What We Do in Life and Business

This Is Your Brain on Sports: The Science of Underdogs, the Value of Rivalry, and What We Can Learn from the T-Shirt Cannon by L. Jon Wertheim and Sam Sommers

I like sports but I think I love the human side of sports more. I read Deadspin, when I had cable I watched Outside The Lines and 30 for 30 and I listen weekly to Only A Game on NPR. Sometimes it feels like professional sports are its own universe where outlandish behavior is acceptable and normal human beings go from boring to insane. This Is Your Brain On Sports proves using studies and journal articles that our outlandish behavior in sports is happening but it’s also very much happening outside of sports.

One of my favorite pieces was about how the public views quarterbacks as universally good looking. The authors dug into this and ran their own study. They found that their subjects, who were not football fans when shown many players’ faces found quarterbacks to be less attractive on average than running backs and defensive linemen. I actually experienced this first hand on our flight to Paris, where Sara and I sat next to a Patriot’s linebacker who Sara would later tell me was “much better looking then either Manning brother.”

What was most interesting about the study was that subjects found quarterbacks to have more leadership qualities purely based on photos. It aligned with a similar study that was done with faces of employees and CEOs and had similar results with CEOs having more presumed “leadership” qualities. These studies got me thinking about getting asked for directions. Discussing with my brother and father, we’ve found we get asked for directions far more than people we know. On my current trip to Paris and Brussels, I’ve been asked for directions in 3 different languages in two different languages, with many other people standing close enough who could be up to the task. So why do they ask me? Maybe I have a face that says I’m approachable? Good with directions? Willing to help? Who knows! But, if you like reading about sports and how it reflects our daily lives check out this book.

This Is Your Brain on Sports: The Science of Underdogs, the Value of Rivalry, and What We Can Learn from the T-Shirt Cannon

Seven Languages in Seven Weeks By Bruce Tate

Full disclosure: I didn’t fully embrace this book, I didn’t do the exercises at the end of each chapter. For most of the languages, I don’t even have them installed on my computer. Ok, It feels good to admit that. With that disclosed, I will say Seven Languages In Seven Weeks was a treat. Of the 7 languages in this book, Ruby, Io, Prolog, Scala, Clojure, Erlang, and Haskell, I had only used 2 of them in the past, Ruby and Scala. My background is mostly in object-oriented, procedural and prototypical languages, however, this book shifts its focus towards languages that are more functional, and are built with pattern matching and concurrency in mind. Concepts that are not focuses of the languages I’ve worked with in the past other than Scala.

Seeing the evolution of languages was insightful to me, how closely tied to Lisp Closure is or how Scala and Erlang’s pattern matching are inspired by Prolog. While first investigating Scala, I could see the implementation of pattern matching but it wasn’t clear how powerful it could be until I saw how Erlang and Prolog leverage it in this book.  While I’ve appreciated limiting state in the systems I build, it was made much more clear how functional programming can be leveraged to supercharge concurrency. Working with languages like Java and Go, concurrency can be a proceed with caution situation because the developers writing the code are worried about race conditions and side effects. When we can significantly limit race conditions and mutable state then concurrency is less scary.

The interviews with the authors of each language spoke volumes about what the tradeoffs and intentions of each language were. Ruby is a language where the trade-offs are most obvious, a great syntax which leads to productivity traded for speed.  Having revisited Ruby recently for other projects both personal and for work, you can see the productivity increase right away but, it might take some time to feel that trade off in speed. The challenge, of course, is knowing when to migrate.

What I discovered at the end of reading this book was the language I wanted to explore next. Oddly enough, it wasn’t in this book and even more strange after doing some more reading it was partially inspired BY this book and that language is Elixir. Elixir combines some of the syntactical sugar we love about Ruby, the metaprogramming of Clojure and the concurrency and power of Erlang.

Seven Languages in Seven Weeks: A Pragmatic Guide to Learning Programming Languages (Pragmatic Programmers)

The Men Who Stare At Goats By Jon Ronson

The Men Who Stare At Goats follows the ups and downs of the use of paranormal and New Age concepts by the US military since the 1970s to Today, mostly focusing on their resurgence after 9/11. The title comes from the Ronson’s search for a man who allegedly has been able to learn how to harness his psychic ability to kill a goat, and other mammals simply by staring at them. The cast of characters is deep and to imagine Ronson tracking them down is part of the fun. Who is the man who actually killed the goat by staring, did it even happen? Who was it that learned Matchbox 20 was the best way to send out subliminal messages? Who took the proposed taking elements of the New Age peaceful measures and distort them to be used in events that lead to Abu Ghraib, Waco and Guantanamo?

The intentions early on seemed so earnest. What if we could take New Age ideas about peace and harness them non-violently in our military? Our military could carry ginseng and could stop their enemies not by firing weapons but by surprising them with hugs and Jedi mind tricks. The violent and disturbing outcomes appear as misunderstandings and distortions of the original intent of the First Earth Battalion, the proposed group of supersoldiers who encouraged nonviolent conflict resolution.

What is clear from reading this book is that parts of the military and intelligence want alternatives. While some of these alternatives are crazy schemes like invisibility and walking through walls, it certainly seems that the military is not about to slow down on them anytime soon.

The Men Who Stare At Goats By Jon Ronson

The Aviator: Eddie Rickenbacker, Jimmy Doolittle, Charles Lindbergh, and the Epic Age of Flight by Winston Groom

There were two striking ideas that came to me while reading The Aviators, the pace of innovation that happened in the period between the World Wars was staggering and that one man can easily fall in and out of grace. The Aviators follows the lives of Eddie Rickenbacker, the ace of aces in World War I, Charles Lindbergh, the first person to fly across the Atlantic and Jimmy Doolittle the first person to fly blind, i.e. only by instruments. The amount of danger involved in what these men did every day was terrifying. Forget about the firefights in WWI and WWII, even just flying from one city to another was risky early on. Not knowing the distance between the plane and the ground and at any moment becoming wrapped up in violent weather, could easily end your life.

The commercial opportunities and mounting war ignited innovation. Flights across the US in clear conditions were nearly impossible and had top speeds of 100 mph. By the battles in WWII, planes were flying several hundred miles an hour, in any kind of weather, all over the world. All three men warned of the need for the United States to become a world leader in aeronautics. Stumping for building the air power of the United States, and fear of future wars brought political enemies. For myself, it’s always interesting to read about isolation and anti-war sentiment, especially in the World Wars. It’s arguably the most brushed aside issue in any history class. Despite the successes of each man, crashes, political enemies, financial troubles, and the Lindbergh baby tribulations force dark times. It is inspiring to see how each man deals with the hard times, overcomes them and ends up so greatly contributing to the US success in WWII.

The Aviators: Eddie Rickenbacker, Jimmy Doolittle, Charles Lindbergh, and the Epic Age of Flight


Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation  By Jez Humble

Hopefully, more books I read this year will have an impact on my day to day life, but at the very least this one definitely will. Continuous Delivery by Jez Humble was suggested to me by our VP of engineering along with Release It By Michael Nygard which I read last year. Both cover the software development pipeline by focusing on resiliency and regularly delivering working code. While many a blog will eschew the ideas of continuous delivery, this book gives you the patterns to use to significantly shorten your cycle time – the time between ideation and code live on production.


If you talk to people who are responsible for releasing software, most will tell you releasing is painful. The reasons for this sentiment are many: they do it rarely so it is a big event, it’s manual, there is no automated testing, they have no idea what they are releasing because someone else wrote it and the kicker of them all, production is different than their testing environments. This book calls all of these complaints and more out as anti-patterns and models many of the solutions around one idea: if it is painful right now, then move it forward in your process and do it more often, which forces you to automate it. This means if there is only manual testing which is slow and at the end of your process then do test driven development, i.e. write tests first. He includes in this TDD pattern server deployments where one builds the monitoring and health checks first before they release the server as the “test” will pass once they are deployed. Chances are for many teams especially ones with little automation in place, it means doing a lot of hard things first, automated testing, configuration management, better version control strategies, and getting staging looking more like production. Humble suggests incremental improvements but it is critical to have testing in place as you can release all the time but your customers won’t be too happy if your applications break regularly.


This book will change the way you deliver software and likely will make your life a lot easier. I also suggest reading Nygard’s Release It either right before or right after as it gives you concrete architecture patterns to implement some of Humble’s ideas.

Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation (Addison-Wesley Signature Series (Fowler))

The Almost Nearly Perfect People: Beyond the Myth of the Scandanvian Utopia By Michael Booth

While coming back from our last trip to Iceland, I was walking around the airport bookstore and this book caught my eye. The book’s covers were tongue in cheek and I made a mental note to read it. I for one have definitely had the author’s experience, seeing the endless surveys and headlines about how great the Scandinavian countries are. Having visited Iceland a couple times I couldn’t help but agree, but maybe there was something I had missed. In The Almost Nearly Perfect PeopleMichael explores the 5 Nordic countries, Denmark, Iceland, Norway, Finland and Sweden and tries to discern what makes them tick. He explores if everything we read in the newspapers like that they are the happiest people in the world is actually true.


What is most striking about the book is he’s able to dig through some dark topics in each nation while still making the reader smile. It was fascinating to read about their past ties to Nazis and how some of their citizens have strong anti-immigration sentiment while others are pushing to expand their large welfare state and espouse their progressive attitudes. Clearly, as homogeneous as these nations appear they still have divergent opinions. Aside from the dark topics he also shows what works about these nations, how they are able to leverage their collectivist attitude to garner universal trust, self-control and openness and how they are so successful financially.

The Almost Nearly Perfect People: Behind the Myth of the Scandinavian Utopia

What If? By Randall Munroe

Are you a regular reader of XKCD? Do you like hypothesizing on the fantastical? Have you considered tying yourself to 100 AK-47s and trying to fly through the air? If you answered yes to any of the previous questions then likely you would be a candidate to read What If?: Serious Scientific Answers to Absurd Hypothetical Questions By Randall Munroe. In What If? Munroe, the author of all of those amazing XKCD comics that we all quote day in and day out, attempts to answer his readers craziest questions. About half the content comes from Munroe’s What if? site so even if you are a regular reader there is more then enough reason to pick it up. For me, the most interesting sections were where he avoided the obvious answers like when answering the question “What if the sun disappeared?” Obviously, we would all freeze and die but think of the upsides, no solar flares, better astronomy, no time zones! It is a great gift for the nerd in your life who wants to find out if the sun actually has ever set on the British empire.

What If?: Serious Scientific Answers to Absurd Hypothetical Questions