Use dummy variables to make multiple regression analysis more flexible!

Update date: Column
Data analysisstatistics

Multiple regression analysis is performed using quantitative data such as numbers, but other things can also be converted into numbers and incorporated into the analysis."Dummy variables"By making good use of this method, it is possible to expand the elements that can be incorporated into multiple regression analysis.

In this article, we will introduce how to create dummy variables, examples of how to use them, and points to keep in mind when actually conducting analysis.

What are dummy variables?:How to express information that cannot be converted into numbers using "0" and "1"

A dummy variable is a method for converting non-numeric data into numbers. Specifically, it converts non-numeric data into a sequence of only "0" and "1".

For example, to see the impact of a consumption tax hike on the economy, you can set the period before the tax hike to "0" and the period after the tax hike to "1," which allows you to take into account the changes caused by the tax hike.

How to create dummy variables: Explanation divided into two cases

In practice, there are two main ways to create data: either a binary choice (e.g. included/not included) or a choice of three or more options (e.g. day of the week).

When creating dichotomous dummy variables

Data is created by converting one of the numbers to "0" and the other to "1".

  • Yes → 1, No → 0
  • Included → 1, Not Included → 0
  • Male → 1, Female → 0

It is easier to understand if you give each dummy variable a name. In the field of econometrics, the names "XX dummy" are often used.

To create a dummy variable with more than two choices:

In this case, you can convert it into data by creating dummy variables according to the number of elements included. If you want to use the day of the week as a dummy variable,

  • Monday dummy: A sequence of numbers with Monday as 1 and other days as 0
  • Tuesday dummy: A sequence of numbers with Tuesday as 1 and other days as 0
  • Below, create a total of 7 patterns, including a dummy for Sunday.

Similarly, if you want to see the impact of three different campaigns (A/B/C) that were implemented separately over different periods,

  • Campaign A dummy: A sequence with 1 for periods when A is being conducted and 0 for other periods
  • Below, we will also create dummy campaigns B and C.

When actually conducting the analysis, we will use a dummy that is one less than the number of elements included from the multiple types we have created.

(I won't go into details here, but using all of them at the same time will make the analysis results extremely unreliable.)

Example of application of multiple regression analysis using dummy variables

Dummy variables are very simple to create, but with some ingenuity they can be used in a variety of analyses.Below, we will introduce some specific analysis examples.

Case 1:Analyzing the effectiveness of "day of the week x flyer distribution" at an izakaya

By using the dummies for the days of the week mentioned earlier, you can see how much of an impact the "day of the week" has. For example, let's say you're running an izakaya and you want to know how much "the number of flyers distributed in front of the store" leads to "an increase in the number of customers."

However, since the number of customers at izakayas increases on Fridays, it is unclear how effective the flyers would have been if they had been distributed on a Friday.

In such cases, we use a dummy variable called "Friday" in our analysis. In other words, we analyze whether there are more customers because it is "Friday" or because "flyers were distributed," or both. The results of this analysis are shown in the figure below.

The analysis includesMarketing mix modeling analysis service MAGELLANis used.

From this, we can see the correlations that "distributing one flyer increases the number of customers by 1 people" and "the number of customers increases by 0.06 people on the day before a holiday." Even if we were to analyze the correlation based only on the "number of flyers distributed," we would see the relationship that "distributing one flyer increases the number of customers by 25 people," but we can see that the fit of the analysis is very poor.

Case 2:Analyzing conversions by "time of day" for email newsletter delivery

Next, let's consider the case of considering "time of day" in the analysis of "factors that affect the conversion rate of email newsletters." Conversion rates are naturally influenced by various factors, but let's say you have a hypothesis that "the response will be better if you send it in the evening." In that case, you can conduct an analysis that also takes into account the time of sending. Specifically, an analysis will be conducted by creating dummy variables that divide the time period into 9-12 o'clock dummy, 12-15 o'clock dummy, and 15-18 o'clock dummy. However, are dummy variables divided into three-hour periods as shown above optimal? In fact, this is where the ingenuity of using dummy variables lies. Rather than mechanically dividing the time into three hours, it may be possible to more accurately represent the movement of data by setting three dummy variables: "dummy for work hours (3-3 am) / dummy for lunch break (8-9 pm) / dummy for home time (12-13 pm)." In this way, how to cut out reality is exactly where the analyst's "ability to hypothesize" is tested.

Case 3:Eliminate "unknown exceptions"

Finally, you can use it to create dummy variables later on in cases where "something is behaving in a special way for reasons that are unclear."

This is probably the most practical method. Unlike the previous two examples, where the analysis was done with a certain degree of hypothesis in mind, in reality, there are often data mixed in that we have no idea why they are behaving in a certain way.

It could be a data collection error, or it could just be that a rare event happened on that day (such as a sharp drop in sales in a week when two typhoons hit). The cause is not clear at that point.

Therefore, by creating a dummy variable with a value of "1" set only for the areas that are judged to be "exceptional," it is possible to incorporate that peculiarity into the analysis. This makes it possible to simultaneously deal with the overall trend and any localized exceptions in a single analysis.

To give a concrete example, let's consider data on "ice cream sales" for August analyzed based on "maximum temperature," "rainfall," and "number of pedestrians."

When analyzing this, we can see that for some reason sales were poor on August 8, but increased dramatically between August 27 and 8. If we remove these two "exceptions," the accuracy of the analysis will be much higher, and we will be able to predict future sales with a high degree of accuracy.

Summary: Using dummy variables expands the possibilities of multiple regression analysis

Dummy variables may seem simple at first glance, but how you use them can greatly affect the accuracy and insight of your analysis.

In fields such as marketing that are easily influenced by "human behavior" and "environmental changes," it is important to make good use of dummy variables.Compelling analysis and empowering decisionsWill lead to.

A guide to multiple regression analysis using Excel

Free downloads of related materials

A guide to multiple regression analysis in Excel that empowers marketers
~ Understand the correlation between marketing measures and business results ~

Related Articles