# A/B TEST SIMULATION IN R

###### A/B TEST SIMULATION IN R
1

R

MEDIUM

last hacked on Jan 27, 2019

We did this project to learn about A/B testing and to be friends on a Sunday afternoon. --- From [Wikipedia](https://en.wikipedia.org/wiki/A/B_testing): > A/B testing (or split-testing) is a randomized experiment with two variants `A` and `B`. It includes application of statistical hypothesis testing (or two-sample hypothesis testing), as used in the field of statistics. A/B testing is a way to compare two versions of a single variable, typically by testing a subject's response to variant A against variant B, and determining which of the two variants is more effective. A/B testing can be powerful for determining whether (given a specific success metric) it's worth adding a product feature, implementing a workflow, among other things.

### Steps:

• Decide what you're testing (develop null and alternate hypothesis)
• Pick a success metric
• Sample size calculation
• How long do we run test for?
• As test is running, watch out for problems
• Analyze results (with appropriate statistical test)

### What are we testing?

• `Ho` : Adding satisfaction questions across each item has no effect on the completion rate of test.
• `H1` : Adding satisfaction question across each item has an effect on the completion rate of test.

## Implementation

``````# install.packages("dplyr")
# install.packages("tibble")
library(dplyr)
library(tibble)
``````

### Set seed for experiment replicability

``````# set seed for experiment replicability
set.seed(100)
``````

### Sample size calculation

When we do a sample size calculation (with `power.prop.test()`), we input:

• The anticipated proportion in control group (`p1`)
• The anticipated proportion in experiment group (`p2`)
• The desired power (`power`): the probability that the test will reject a false null hypothesis
``````# calculate our sample size in group_a and group_b
power.prop.test(p1 = 0.7, p2 = 0.75, power = 0.8)
``````
``````Two-sample comparison of proportions power calculation

n = 1250.717
p1 = 0.7
p2 = 0.75
sig.level = 0.05
power = 0.8
alternative = two.sided

NOTE: n is number in *each* group
``````

From this we learn that each of the group samples needs to be at least 1251, each.

### Mocking data for control group

``````# mocking data for group_a: control group
group_a <- tibble(
user_id = seq(1, 3000, by = 2),
test_group = rep(c("a"), 1500),
completed_assessment = sample(
c("completed", "not_completed"),
1500,
replace = TRUE,
prob = c(0.7, 0.3)
),
overall_score = rnorm(1500, mean = 100, sd = 15)
)
``````

### Mocking data for treatment group

``````# mocking data for group_b: treatment group
group_b <- tibble(
user_id = seq(2, 3000, by = 2),
test_group = rep(c("b"), 1500),
completed_assessment = sample(
c("completed", "not_completed"),
1500,
replace = TRUE,
prob = c(0.65, 0.35)
),
overall_score = rnorm(1500, mean = 98, sd = 12)
)
``````

### Exploring the control group data

``````glimpse(group_a)
``````
``````Observations: 1,500
Variables: 4
\$ user_id              <dbl> 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57,...
\$ test_group           <chr> "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", ...
\$ completed_assessment <chr> "completed", "completed", "completed", "completed", "completed", "completed", "not_completed", "completed", "c...
\$ overall_score        <dbl> 99.16123, 94.41932, 107.11450, 99.07443, 68.08466, 84.96644, 92.23293, 133.15315, 104.87477, 80.84787, 131.179...
``````
``````group_a %>%
count(completed_assessment)
``````
``````+   count(completed_assessment)
# A tibble: 2 x 2
completed_assessment     n
<chr>                <int>
1 completed             1024
2 not_completed          476
``````

### Exploring the experiment group data

``````glimpse(group_b)
``````
``````Observations: 1,500
Variables: 4
\$ user_id              <dbl> 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58...
\$ test_group           <chr> "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", ...
\$ completed_assessment <chr> "not_completed", "completed", "completed", "completed", "completed", "completed", "completed", "completed", "c...
\$ overall_score        <dbl> 97.10005, 86.63501, 98.00451, 79.70051, 99.18022, 79.68071, 82.09675, 105.24031, 96.82983, 96.36051, 109.07248...
``````
``````group_b %>%
count(completed_assessment)
``````
``````+   count(completed_assessment)
# A tibble: 2 x 2
completed_assessment     n
<chr>                <int>
1 completed              974
2 not_completed          526
``````

### Joining our control and experiment groups

``````group_a_b <- bind_rows(group_a, group_b)
``````

### Arraging data by `user_id`

``````arranged_group_a_b <- group_a_b %>%
arrange(user_id)
``````
``````# A tibble: 3,000 x 4
user_id test_group completed_assessment overall_score
<dbl> <chr>      <chr>                        <dbl>
1       1 a          completed                     99.2
2       2 b          not_completed                 97.1
3       3 a          completed                     94.4
4       4 b          completed                     86.6
5       5 a          completed                    107.
6       6 b          completed                     98.0
7       7 a          completed                     99.1
8       8 b          completed                     79.7
9       9 a          completed                     68.1
10      10 b          completed                     99.2
# ... with 2,990 more rows
``````

### Generating table of completions by test group

``````prop_table <- table(
arranged_group_a_b\$test_group,
arranged_group_a_b\$completed_assessment
)
``````

### two-sample proportion test

``````prop.test(prop_table)
``````
``````    2-sample test for equality of proportions with continuity correction

data:  prop_table
X-squared = 3.5979, df = 1, p-value = 0.05785
alternative hypothesis: two.sided
95 percent confidence interval:
-0.00106645  0.06773312
sample estimates:
prop 1    prop 2
0.6826667 0.6493333
``````

According to this `p-value` of `0.05785`, we do not have enough evidence to reject the null hypothesis (`Ho`). We reject if `p-value < 0.05`. Evidently, it's not

# `script.R` (all code):

``````# install.packages("tidyverse")
library(tibble)
library(dplyr)

# set seed for experiment replicability
set.seed(100)

# calculate our sample size in group_a and group_b
power.prop.test(p1 = 0.7, p2 = 0.75, power = 0.8)

# mocking data for group_a: control group
group_a <- tibble(
user_id = seq(1, 3000, by = 2),
test_group = rep(c("a"), 1500),
completed_assessment = sample(
c("completed", "not_completed"),
1500,
replace = TRUE,
prob = c(0.7, 0.3)
),
overall_score = rnorm(1500, mean = 100, sd = 15)
)

# mocking data for group_b: treatment group
group_b <- tibble(
user_id = seq(2, 3000, by = 2),
test_group = rep(c("b"), 1500),
completed_assessment = sample(
c("completed", "not_completed"),
1500,
replace = TRUE,
prob = c(0.65, 0.35)
),
overall_score = rnorm(1500, mean = 98, sd = 12)
)

# exploring group_a: the control
glimpse(group_a)
group_a %>%
count(completed_assessment)

# exploring group_b: the experiment
glimpse(group_b)
group_b %>%
count(completed_assessment)

# joining our control and experiment groups
group_a_b <- bind_rows(group_a, group_b)

# arraging data by user_id
arranged_group_a_b <- group_a_b %>%
arrange(user_id)

# preview of new working dataset
arranged_group_a_b

# generating table of completions by test group
prop_table <- table(
arranged_group_a_b\$test_group,
arranged_group_a_b\$completed_assessment
)

# two-sample proportion test
prop.test(prop_table)
``````