Saturday, December 20, 2014

Maximizing Opening Weekend Sales for Wide-Release Horror Films


As scared as I get, I really do enjoy the experience of watching horror films.  But what exactly determines a horror film's success?  More specifically...

  • How can a horror film producer maximize a wide-release horror film's opening weekend sales?  
  • Does the horror film's budget matter?  
  • Which major studio distributor should the producer use?
  • What other factors (e.g. month of release, runtime, sub-genre, etc.) are at play?

I set out to answer these questions using BoxOfficeMojo data from all wide-release horror films with film budget information.  A wide-release film is released in at least 600 theaters, thereby factoring out all obscure horror films.

I built a regression model to predict opening weekend sales from the following features of each horror film:
  • number of theaters released (≥ 600 for wide-release)
  • budget for the film
  • years since 1975 (are opening weekend sales increasing or decreasing?)
  • runtime of movie (does film length matter?)
  • the rating of the movie (PG-13 or R rating)
  • horror subgenre (comedy, found footage, period, prequel, remake, scifi, serial killer, slasher, supernatural, terror in the water, torture, vampire, video game adaptation, & zombie) ... a movie can be of multiple subgenres
  • studio distributor (Buena Vista, Dimension, DreamWorks, Fox, Lionsgate, MGM, Miramax, Paramount, Relativity Media, Sony Entertainment, Universal Pictures, & Warner Brothers)
  • month of release

Ordinary least squares was used to estimate the parameters for the model below:


There was a significant (p<0.001) positive correlation of number of theaters released with opening weekend sales.  Specifically, the model predicts that one additional theater would result in a $15,560 ± $1,869 increase in opening weekend sales.  

The figure below shows the relationship between opening weekend sales and the number of theaters released and illustrates: 
  • the positive correlation of opening weekend sales with the number of theaters released
  • the observed data compared with the model prediction (R-squared: 0.871).

R-squared goodness of fit value: 0.871

Interestingly, the only other significant correlation found in the regression model was a positive correlation (p=0.026) of the period horror subgenre on opening weekend sales.  Specifically, the model predicts that releasing a period horror film (i.e. a horror film set in a past time period) would result in a $7.66 ± $3.39 million dollar increase in opening weekend sales.  Some examples of period horrors include Paranormal Activity 3, The Conjuring, Exorcist: The Beginning, Sleepy Hollow, and Shutter Island.

Why do period horror films appear to do better than the rest?  My hypothesis is that period horrors are usually based on something familiar, whether it being the time period or the story.  Due to the mere-exposure effect, individuals tend to show an affinity toward encountered items over unencountered ones.  This may result in a subconscious preference for watching familiar period horror films, especially for older generations.  In addition, the availability heuristic biases individuals to believe that more familiar events are more likely to occur.  This can intensify the creepiness of the film and increase viewership by horror film fans.

These features did not have a significant effect on opening weekend sales:
  • budget
  • years since 1975
  • runtime
  • rating
  • studio/distributor
  • month of release

Based on the findings, my advice for film producers wishing to maximize opening weekend sales for their wide-release horror film is to:
  • release in as many theaters as possible
  • produce a period horror thematically based on a legend, myth, and/or time period that most people are familiar with
  • not worry about the film's budget
  • not worry about the possible decrease in the number of movie-goers with the increased accessibility of streaming movie providers such as Netflix
  • not worry about the length of the film or the rating
  • not worry about which studio/distributor to use unless they can release in more theaters
  • not worry about releasing the film around Halloween...there is no evidence that it helps (possibilities include students being in school, lack of proximity to holidays such as Thanksgiving or Christmas, and increased competition from other horror films)

Thank you

=]

Friday, November 28, 2014

The LA|NY State of Mind: LA Restaurants May Offer Better, More Memorable Service Compared to NYC Restaurants


Do LA restaurants generally offer better service than NYC restaurants?  As before, I answered this question using Yelp restaurant comments from the top 40 restaurants from 44 different categories of cuisine in LA and NYC.  To determine the relative impact of service, I first calculated the tf-idf of the word service in all comments from the top 40 restaurants in each of the 44 categories.  I then normalized the tf-idf of the word service as a percentage in relation to the highest tf-idf word from the same set of comments.  I only used 5-star and 1-star Yelp comments to remove any ambiguity in sentiment when comparing positive and negative service, respectively.  I then compared the average impact of positive and negative service across all 44 restaurant categories within each city as well as between cities.  

I found that within each city, there were no significant differences in positive versus negative service importance (LA: t(43)=0.722 , p=0.472 ; NYC: t(43)=1.502 , p=0.137).  Thus, neither LA nor NYC diners were more positive or negative about service.  There was also no significant difference between cities with regards to average importance placed toward negative service (t(43)=0.090, p=0.929).   However with positive comments,  I found a significant difference whereby LA diners placed a significantly higher importance on positive service compared to their NYC counterparts (t(43)=2.272 , p=0.026).  


Negative vs Positive Service Importance Within Cities

N.S. = not significant (p≥0.05)

Negative vs Positive Service Importance Between Cities

star = significant (p<0.05)


For the top 40 restaurants among 44 popular categories of food, LA restaurant reviewers placed significantly more importance on positive service compared to NYC restaurant reviewers.  

I can't help but wonder:
  • Does LA's sunnier, warmer, and less rainy weather make diners and servers more positive?
  • Conversely, does NYC's more gloomy and cold weather make diners and servers less positive?
  • Do other positive aspects (taste of the food, plating, ambience, etc.) in NYC restaurants overshadow the importance of service?
Ultimately for top restaurants in each city, LA may offer more memorable positive service than NYC.  

Thanks again, 

Thursday, November 27, 2014

SERVING THE CITY OF ANGELS: How Important is 'Service' among LA Restauranters?









Ahh the City of Angels (LA).  Like the previous post about serving the Big Apple, I will now discuss how important service is to different categories of restaurants in LA.  To address this, I used Yelp comment data (as of Nov 6th, 2014) from the top 40 restaurants from 44 different categories of cuisines.  I looked at high (5-star) and low (1-star) comments in order to discover the relative impact of positive and negative service in restaurant assessments.

To calculate the importance of service, I used a text mining method called term frequency inverse document frequency (tf-idf) (see previous post on serving the Big Apple for details on this method).

Figures 1 and 2 below are some example word clouds that illustrate the relative importance (i.e. relative tf-idf) of the word service.  More specifically, each word cloud below shows the top 50 tf-idf words from positive (Figure 1) and negative (Figure 2) comments in a particular restaurant category, and sizes the words by its relative tf-idf value.



Figure 1
French Restaurant Category Word Cloud (Positive 5-Star Comments)
Relatively High Service Importance


Figure 2
Pizza Restaurant Category Word Cloud (Negative 1-Star Comments)
Relatively Low Service Importance


High Yelp Ratings 

Using only 5-star rated comments from the different categories of restaurants, I found the following (Figure 3) relative tf-idf values for the word service ordered by restaurant category from highest to lowest.


Figure 3


The categories with the most positive service comments were steakFrenchItalian, and diner restaurants.  The restaurant categories with the least positive service comments (note that this does not mean negative, simply less impact of service) were foodstandsfish n chipspizza, and Filipino.  


Low Yelp Ratings 


Using only 1-star rated comments fron m the different categories of restaurants, I found the following (Figure 4) relative tf-idf values for the word service ordered by restaurant category from highest to lowest.

Figure 4


The categories with the most negative service comments were from cafes, Korean, breakfast/brunch, and traditional American restaurants.  The restaurant categories with the least negative service comments were food standspizzafood courts, and fondue.

As was found in the Big Apple, a positive relationship was found between high and low service normalized tf-idf values (p<0.001).  Thus, the more importance LA restaurant diners placed on positive service, the more they placed on negative service as well (and vice versa).  Next I examined whether the price of the restaurant influenced how important service was for the diners.  To examine this possibility, I used Yelp's restaurant pricing system ($, $$, $$$, $$$$).  These dollar signs represent the cost per person for a meal including one drink, tax, and tip (see Serving the Big Apple post for details).

For each category of food, I calculated a PRICE SCORE to quantify the overall price of a particular restaurant category from its top 40 restaurants using the following equation:


PRICE SCORE = 
(# of $ restaurants) + 2*(# of $$ restaurants) + 3*(# of $$$ restaurants) + 4*(# of $$$$ restaurants)

As opposed to the Big Apple, I found a marginally insignificant (p=0.051) positive correlation between price score and positive service importance (from 5-star Yelp comments).  Although a negative correlation was found between price score and negative service importance (from 1-star Yelp comments), this correlation was not significant (p=0.658).

I deduce two salient possibilities from this data:

1.  The price of LA restaurants do not necessarily equate to better service. 
2.  LA restaurant diners may generally place more importance on other factors (e.g. ambience, taste, plating).

Wednesday, November 12, 2014

SERVING THE BIG APPLE: How Important is 'Service' among NYC Restaurant Diners?


How important is service to different categories of restaurants in NYC?  To address this, I used Yelp comment data (as of Nov 6th, 2014) from the top 40 restaurants from different categories of cuisine (e.g. Traditional American, French, Mexican, burgers, etc.).  I looked at high (5-star) and low (1-star) comments in order to discover the relative impact of positive and negative service in restaurant assessments.

To calculate the importance of service, I used a text mining method called term frequency inverse document frequency (tf-idf), which counts the number of times a word appears in all the comments and divides that number by the number of comments that contain at least one instance of that word.  This calculation allows us to quantify the importance of words while reducing the importance of words that appear in almost all comments such as 'a', 'the', 'I', etc. that are less meaningful.  Once I found the tf-idf values of all the words, I normalized the tf-idf value of the word service relative to the highest tf-idf value (associated with the most important word) to obtain a measure for the relative importance of service in the users' dining experiences.  For example, if the highest tf-idf was 0.5 and service's tf-idf value was 0.2, service's normalized tf-idf would be 0.4 or 40%.

Figures 1 and 2 below are some example word clouds that illustrate the relative importance (i.e. relative tf-idf) of the word service.  More specifically, each word cloud below shows the top 50 tf-idf words from positive (Figure 1) and negative (Figure 2) comments in a particular restaurant category, and sizes the words by its relative tf-idf value.

Figure 1
French Restaurant Category Word Cloud (Positive 5-Star Comments)
Relatively High Service Importance


Figure 2
Pizza Restaurant Category Word Cloud (Negative 1-Star Comments)
Relatively Low Service Importance

High Yelp Ratings 

Using only 5-star rated comments from the different categories of restaurants, I found the following (Figure 3) relative tf-idf values for the word service ordered by restaurant category from highest to lowest.


Figure 3

The categories with the most positive service comments were French, seafood, steak, and Italian restaurants.  The restaurant categories with the least positive service comments (note that this does not mean negative, simply less impact of service) were food courts, food stands, pizza, and hot dog.


Low Yelp Ratings 

Using only 1-star rated comments from the different categories of restaurants, I found the following (Figure 4) relative tf-idf values for the word service ordered by restaurant category from highest to lowest.

Figure 4


The categories with the most negative service comments were traditional American, French, German, and fondue restaurants.  The restaurant categories with the least negative service comments were food courts, food stands, pizza, and hot dog.


From Figures 3 and 4, the importance of service appears to be similar regardless of the valence of the comments for a particular restaurant category.  Statistically, a positive relationship was indeed found between high and low service normalized tf-idf values (p<0.001).  Simply put, NYC diners within a restaurant category placed roughly equal importance to service no matter if it was a positive or negative experience.  But why would this be the case?  One possibility is that the price of the restaurant influenced how important service was for the diners.  To examine this possibility, I used Yelp's restaurant pricing system ($, $$, $$$, $$$$).  These dollar signs represent the cost per person for a meal including one drink, tax, and tip.

  • $ = under $10
  • $$ = $11-$30
  • $$$ = $31-$60
  • $$$$ = above $61
For each category of food, I calculated a PRICE SCORE to quantify the overall price of a particular restaurant category from its top 40 restaurants using the following equation:


PRICE SCORE = 
(# of $ restaurants) + 2*(# of $$ restaurants) + 3*(# of $$$ restaurants) + 4*(# of $$$$ restaurants)

As one may expect, I found a significant (p=0.001) positive correlation between price score and positive service importance (from 5-star Yelp comments).  This means the more expensive the restaurant category, the better the service and/or the more importance Yelp diners placed on good service in their reviews.  Although a negative correlation was found between price score and negative service importance (from 1-star Yelp comments), this correlation was not significant (p=0.158).

Ultimately for diners in the Big Apple, the more you pay for your meal, the better the service and/or the more attention you pay to great service.  In addition, you also pay more attention to bad service.

Monday, November 3, 2014

NYC vs Los Angeles: which cuisines reign supreme?



New York City and Los Angeles are the two most populated US cities, and each city has an overwhelming amount of diversity, especially in terms of ethnic restaurants.  It has been an ongoing debate as to which city's cuisines reign supreme.  I am here to offer statistical insight into this matter.  Keep in mind, this is only using restaurant ratings from Yelp (as of Nov 1, 2014).

Using the Yelp API, I compared the ratings of the top 40 rated restaurants in New York City and Los Angeles from 119 different restaurant categories.  Out of these categories I took only the categories that had at least 100 Yelp certified restaurants in each city to avoid low sampling biases and to ensure fair comparisons.  Here is what I found.



Among the top 40 restaurants in each city, New York City offered more highly rated French, Italian, Tex-Mex, and Thai restaurants as compared to Los Angeles.


*: p<0.05


Los Angeles, on the other hand, offered more highly rated BBQ, fast food, Korean, Mexican, and salad restaurants among its top 40 as compared to New York City.


*: p<0.05, **: p<0.01, ***: p<0.001

Note that all the findings I previously reported were statistically significant using an unpaired two-sample t-test, assuming unequal variances, with a p-value less than 5%.  The categories of restaurants that did not come out significantly different between cities were:

  • New American
  • Traditional American
  • Asian Fusion
  • Breakfast Brunch
  • Burgers
  • Cafes
  • Chicken Wings
  • Chinese
  • Delis
  • Diners
  • Hot Dog
  • Indian/Pakistani
  • Japanese
  • Latin
  • Mediterranean
  • Middle Eastern
  • Pizza
  • Sandwiches
  • Seafood
  • Steak
  • Sushi
  • Vegan
  • Vegetarian
  • Vietnamese

So there you have it, Yelp has spoken (as of November 1st, 2014), and we are left a little bit more insightful about what kinds of restaurants to choose in our next foray to these great cities.  

Thursday, October 30, 2014

Modeling & Predicting Coronary Heart Disease with Logistic Regression

Coronary (ischemic) heart disease results from plaque built up in arteries that supply blood and oxygen to the heart.  The narrowing of these arteries can culminate into a heart attack, and is one of the leading causes of death in men and women.  As a motivator to stay healthy, I believe people could benefit from a quantifiable way of measuring their relative risk of heart disease.  This is similar to the Framingham risk score, but was modeled from different datasets.  First, I set out to explore the data to gain insight into the important features to be used in the model.  Then I used cross-validation of different supervised machine-learning algorithms to build an optimized model.



Heart Disease Between Genders: Age and Cholesterol

Using IHIS data from 2000-­2013, men aged 40+ were found to be at a significantly (t­-test, p<0.05) increased risk for heart disease compared to women of the same age (see Figures 1 and 2).

Figure 1


Assuming cholesterol level has a positive relation to risk of heart disease, the increased risk of older men compared to older women does not appear to be the result of increased cholesterol levels in older men (see Figures 3 and 4). Women instead appear to have a statistically higher (p<0.05) cholesterol level between 40-­42 years of age as well as above 58 years of age compared to their male counterparts. Unless cholesterol has a negative relation with heart disease, it appears that the risk from being an older man is largely independent of cholesterol.

Figure 3



Heart disease risk model

Using the previous insight of the significant combined effect of age and gender to model the UCI datasets, I built a logistic regression model that predicted an individual’s risk for heart disease (P(heartDisease)) using three highly significant (p<0.001) features:

1. age*gender
2. cholesterol level [cholesterol]
3. maximum heart rate achieved during exercise [max exercise HR]


Of the models tested, the logistic model (see above) had the highest prediction accuracy (74%) with a precision and recall of 70% and 68%, respectively. More generally, this model predicts that being an older male, having high cholesterol, and achieving a low maximum heart rate during exercise increases the likelihood of heart disease.  As quantitative examples of this heart disease risk modeling, if a 42 year old man who achieves 142 max beats per minute (bpm) during exercise reduces his cholesterol level from 250 mg/dL to 240 mg/dL (keeping all other features constant), he will have reduced his risk for heart disease by 1.3%. Compared to a man, a woman with these exact same stats will have a 22.4% reduced risk of heart disease. And finally, if this woman increases her max heart rate during exercise from 142 bpm to 152 bpm (keeping all other features constant), she will have reduced her risk for heart disease by 3.6%.  The following web application illustrates and quantifies this model (screenshot below).


Heart Disease and Menopause

I used 1994 and 1998 IHIS data to determine the relationship of a woman’s risk for heart disease with her menopausal status. As shown in Figure 5 and 6, I determined that 40­-42 year old women with menopausal symptoms have a significantly (t­-test, p<0.05) higher likelihood of heart disease compared to 40-­42 year old women who have never experienced menopausal symptoms. Thus it appears that a younger woman’s likelihood of heart disease may be increased if she has menopause. Future work can predict a woman’s likelihood of having menopause (if status is unknown) from other key information such as smoking, diabetes status, age, and other factors. This can then be incorporated in the heart disease model to determine whether it can more accurately predict a woman’s risk of heart disease.


Summary

Heart disease between genders: age and cholesterol
Men aged 40 and up are at an increased risk for heart disease compared to women of the same age. This statistically significant effect does not appear to be the result of increased cholesterol levels.

Heart disease risk model: effects of gender, age, cholesterol, and max heart rate during exercise
I built a model that predicted an individual’s risk for heart disease based on his/her gender and age, cholesterol level, and maximum heart rate attained during exercise. Specifically the model predicted that being older, male, having high cholesterol, and reaching a low maximum heart rate during exercise increased the likelihood of heart disease.

Heart Disease and Menopause
40­-42 year old women with menopausal symptoms have a significantly higher likelihood of heart disease compared to 40­-42 year old women who have never experienced menopausal symptoms. Future work can predict a woman’s likelihood of having menopause (if not available) from other key information to see if it can better predict a woman’s risk of heart disease.

Data sources

IHIS
UCI Heart Disease Data 1, Data 2

Tuesday, September 30, 2014

How the Affect Heuristic Can Influence Consumer Behavior Toward Packaged Products


When perusing products in an aisle of a store, we quickly form positive or negative impressions of the value of some products even without all the necessary information (e.g. price, number of items in the package, etc.). This is because our brains have evolved mental approximations such as the affect heuristic (Slovic, Finucane, Peters, & MacGregor, 2007), allowing us to evaluate environmental stimuli rapidly. More specifically, the affect heuristic is an immediate evaluation of the positive or negative valence of stimuli on attributes that are easier to determine. However, this also means that attributes that are more difficult to discern are ignored during this evaluation process. Because the affect heuristic happens so automatically and can save us time on shopping trips, it can have a large unconscious impact on what we choose to purchase.

Studies with people and monkeys show that quality is an attribute that is easier to discern than quantity. In people (Hsee, 1998), participants in a study rated a 24-piece dinnerware set to be of higher value as compared with the same 24-piece set with an additional sixteen pieces of which nine were broken (thus, seven additional pieces intact). Although the latter option quantitatively offered more intact pieces, its evaluation was reduced by the additional items in poor qualitative condition. On the other hand, when the same items were juxtaposed as a choice so that they could be directly compared, the same participants now chose the 40-piece dinnerware set that included broken pieces over the 24-piece set that they originally rated as higher in value.

Initially, participants preferred Set L over Set H.  When juxtaposed as a choice, the participants now chose Set H over Set L


In a similar vein, rhesus monkeys (Kralik, Xu, Knight, Khan, & Levine, 2012) preferred a highly-valued food item alone compared to the same highly-valued food item paired with an additional food item of positive but lower value. Thus, the monkeys were evaluating the choice options based on overall quality while neglecting overall quantity. Given repeated trials, however, the monkeys were no longer biased toward the highly-valued food item in isolation. Therefore, the monkeys’ choices were influenced by experience with the choice options, such that experience allowed the monkeys to consider more attributes, i.e. the quantity and overall value of the obtained food items instead of the average quality of the items alone. These results suggest that the affect heuristic, which may have been conserved through evolution, favored quality over quantity but can yet be overridden during joint evaluations and experience.

How could this affect heuristic affect consumer behavior? Because it appears that we are drawn to quality over quantity in our evaluations and choices, shoppers may unconsciously pay a premium for single or packaged products with a larger proportion of their most desired items over more mixed packages with a larger number of items that may actually offer a better overall deal. To prevent this, shoppers can try placing packaged products side by side to avoid the biasing effects of affect heuristic memories.  This will help consumers more objectively take into account important features such as quantity in shopping decisions.



a 6-pack of cheetos

A 50 pack of assorted chips, with 10 packs of cheetos.

According to the affect heuristic, a cheetos lover will most likely value this significantly less than the 6-pack of cheetos without other flavors (per pack) even if the other flavors are not aversive.