Hanukkah of Data 5784: R and Julia

This post documents my experience using R and Julia to solve this year’s Hanukkah of Data challenges.

R
Julia
Hanukkah-of-Data
Author
Published

December 28, 2023

People always say that it must be nice that Hanukkah has 8 days of presents. This year, with everyone doing Advent of Code for 3+ weeks, I’m grateful that the Hanukkah of Data challenge only has eight days.

The Hanukkah of Data is a fun puzzle that gives folks a chance to practice coding and data analysis skills on a fun, low-stakes group challenge. I’m using this as an opportunity to explore the use of Julia and Python. I still find R to be a much more intuitive data analysis tool, but I want to stretch my legs and build my skills with these other languages, too. So this blog post documents the approach I used to solve these puzzles with Julia and R. This post focuses more on the programming than on the problem-solving; if you’re looking for a good post that talks about the reasoning for each night’s puzzle, I’d highly recommend this post from Ella Kaye: https://ellakaye.co.uk/posts/2023-12-07_hanukkah-of-data-5784/.

In general, I found that Julia syntax for data manipulation was less readable than the Tidyverse syntax that’s become popular in R. The Julia data manipulation syntax for the foundational package DataFrames often feels less like a user-centered design and more like an intellectual exercise for the designers. The extension package DataFramesMeta makes things much more readable, but there’s still a lot more cruft than in the tidyverse: for example, each pipe involves more annoying bookkeeping syntax like @chain mydata begin ... end. So my main takeaway of the foundational Julia data manipulation packages is that recommending it to the average data analyst would be like recommending a custom Linux distribution to your grandmother. The other big challenge I found with using Julia was that the error messages were often quite difficult to understand: the stacktraces are usually quite long and heavily laden with jargon, so that they’re helpful for package developers but inscrutable to a typical data analyst.

With that said, let’s solve these puzzles!

Setup: Configuring a Project Folder and Installing Packages

As a little bit of housekeeping, I set up a project folder. To keep track of the Julia packages and their versions, I created a TOML file for the project. In the Julia REPL, I opened the package manager by typing ]; then, from within the package manager, I created a TOML file by simply running activate ..

Then I added a few Julia packages:

```{julia}
#| label: julia-pkg-install
#| output: false
#| eval: false
#| code-overflow: wrap
 import Pkg
 Pkg.add(["StatsBase", "DataFrames", "DataFramesMeta", "Tidier", "TidierStrings", "CSV", "Pipe", "Chain"])
```

Here’s what this looked like as I waited for these to install.

Precompiling project...
  Progress [================================>        ]  123/155
  ◐ TiffImages
  ◓ ImageCore
  ✓ Distributions → DistributionsChainRulesCoreExt
  ◐ Polynomials → PolynomialsMakieCoreExt
  ◐ Polynomials → PolynomialsChainRulesCoreExt

Maaaany dependency installations and roughly 15 minutes later, I have all of these packages installed and ready to use. With that out of the way, I closed the package manager by typing the keyboard shortcut “CTRL+C”. Then I loaded these packages like so.

using DataFrames, DataFramesMeta, CSV, Chain
import Tidier as TD
import Dates as DATE

Next, I downloaded the data from the Hanukkah of Data site. I unzipped the four CSV files into a subdirectory of the project named data. Here’s the list of files in the data folder:

readdir("data")
4-element Vector{String}:
 "noahs-customers.csv"
 "noahs-orders.csv"
 "noahs-orders_items.csv"
 "noahs-products.csv"

Let’s go ahead and load all of the datasets.

customers    = DataFrame(CSV.File("data/noahs-customers.csv"))
orders       = DataFrame(CSV.File("data/noahs-orders.csv"))
orders_items = DataFrame(CSV.File("data/noahs-orders_items.csv"))
products     = DataFrame(CSV.File("data/noahs-products.csv"))
library(tidyverse)
customers    <- read_csv("data/noahs-customers.csv")
orders       <- read_csv("data/noahs-orders.csv")
orders_items <- read_csv("data/noahs-orders_items.csv")
products     <- read_csv("data/noahs-products.csv")

First Night of Hanukkah

Julia Solution

Here’s the setup for Night 1.

We know that we need to identify a customer whose last name can be spelled out by a phone’s keypad. So we’ll create two helper functions: the first will extract a person’s surname, and the second will translate a given surname into a phone number.

# Define function to extract the surname from the full name
  function get_surname(full_name)
    regex_match = match(r"[A-z]+$", full_name)
    if isnothing(regex_match)
      ""
    else
      uppercase(regex_match.match)
    end
  end
get_surname (generic function with 1 method)

# Define a function to convert a name to a phone number

  # First create a dictionary giving the letters included on each key in a phone keypad
  phone_keypad = Dict([               ('2', "ABC"), ('3',  "DEF"), 
                       ('4',  "GHI"), ('5', "JKL"), ('6',  "MNO"), 
                       ('7', "PQRS"), ('8', "TUV"), ('9', "WXYZ")])
Dict{Char, String} with 8 entries:
  '2' => "ABC"
  '3' => "DEF"
  '4' => "GHI"
  '7' => "PQRS"
  '6' => "MNO"
  '9' => "WXYZ"
  '8' => "TUV"
  '5' => "JKL"

  # Initialize the dictionary we're really interested in
  phone_keypad_lookup = Dict{Char, Char}()
Dict{Char, Char}()

  # Assemble the dictionary
  for key in keys(phone_keypad)
    for letter in phone_keypad[key]
      phone_keypad_lookup[letter] = key
    end
  end

  function name_to_phone_number(surname)
    reduce(string, [phone_keypad_lookup[letter] for letter in uppercase(surname)]; init = "")
  end
name_to_phone_number (generic function with 1 method)

Now we’ll use these functions to filter the customers data to just customers whose names can be spelled by their phone number. Note that I’ve loaded the Tidier package using an import statement. This is because the Tidier package includes macros whose names coincide with macros from another package I’ve loaded (DataFramesMeta), and I want it to make sure that I always clearly refer to Tidier package macros with the prefix TD..

import Tidier as TD
import TidierStrings as TDS
# Identify the person whose last name matches their phone number
@chain customers begin
  @TD.mutate(phone_number_digits = TDS.str_remove_all(phone, "-"),
             surname             = get_surname(name))
  @TD.mutate(name_as_number      = name_to_phone_number(surname))
  @TD.filter(phone_number_digits == name_as_number)
  @TD.select(name, surname, phone)
end
1×3 DataFrame
 Row │ name            surname     phone
     │ String31        String      String15
─────┼──────────────────────────────────────────
   1 │ Sam Tannenbaum  TANNENBAUM  826-636-2286

R Solution

For comparison, here’s my R solution:

# Create a lookup key for letters to numbers
  keypad <- list(
                                 `2` = c("A", "B", "C"), `3` =      c("D", "E", "F"), 
    `4` = c("G", "H", "I"),      `5` = c("J", "K", "L"), `6` =      c("M", "N", "O"), 
    `7` = c("P", "Q", "R", "S"), `8` = c("T", "U", "V"), `9` = c("W", "X", "Y", "Z")
  )
  
  keypad_letter_lookup <- vector()
  for (key in names(keypad)) {
    for (letter in keypad[[key]]) {
      keypad_letter_lookup[letter] <- key
    }
  }
  
# Create a function to convert names to numbers
  name_to_number <- function(name) {
    
    letters_of_names <- name |> toupper() |> str_split(pattern = "")
    
    map_chr(letters_of_names, function(letters) {
      keypad_letter_lookup[letters] |> paste(collapse = "")
    })
    
  }
  
# Identify the person of interest
  
  customers |>
    mutate(
      surname           = str_extract(name, "[A-z]+$"),
      surname_as_number = name_to_number(surname)
    ) |>
    filter(
      surname_as_number == str_remove_all(phone, "-")
    ) |>
    select(name, surname, phone)
# A tibble: 1 × 3
  name           surname    phone       
  <chr>          <chr>      <chr>       
1 Sam Tannenbaum Tannenbaum 826-636-2286

Miscellaneous Observations

The biggest difference between these R and Julia solutions is that for R we had to split the names into their component characters, whereas in Julia, strings act like arrays of characters and so they don’t need to be split in order to iterate over the characters. This feels very similar to Python and is a nice feature. Of course, the data manipulation at the end is almost identical to the equivalent R code, since I used the Tidier package (which is a partial copy of the Tidyverse). That’s nice for useRs like me, although it’s troubling that the Julia package completely borrowed the user interface from the R packages without crediting the R package authors as co-authors of the user interface. I’ll keep an eye on the Tidier package, since it’s a promising tool for writing reader-friendly data manipulation code in Julia.

Second Night of Hanukkah

Here’s the setup for Night 2.

For the second night of Hanukkah, instead of relying on Tidier to provide a familiar user interface, I’ll use the DataFramesMeta and DataFrames packages directly. The strategy for solving tonight’s puzzle is to identify any orders that include bagels, and then–among those orders–filter for customers whose names have the initials “JP”.

Julia Solution

# Filter for orders of bagels
@chain products begin
  @rsubset occursin("bagel", lowercase(:desc))
  innerjoin(orders_items, on = :sku)
  innerjoin(orders,       on = :orderid)
  # Filter for orders in 2017
  @rtransform :ordered = DATE.DateTime(:ordered, "yyyy-mm-dd HH:MM:SS")
  @rsubset(DATE.year(:ordered) == 2017)
  # Find customers whose initials are J.P.
  innerjoin(customers,    on = :customerid)
  @rsubset occursin(r"^J[A-z]+ P[A-z]+", :name)
  @select :customerid :name :phone
end
1×3 DataFrame
 Row │ customerid  name             phone
     │ Int64       String31         String15
─────┼───────────────────────────────────────────
   1 │       1475  Joshua Peterson  332-274-4185

R Solution

For comparison, here’s my R solution.

library(lubridate)

# Filter for orders containing bagels
products |>
  filter(str_detect(tolower(desc), "bagel")) |>
  inner_join(orders_items, by = "sku") |>
  # Only look at orders in 2017
  inner_join(orders,       by = "orderid") |>
  filter(year(ordered) == 2017) |>
  # Look for a customer with the initials J.P.
  inner_join(customers,    by = "customerid") |>
  filter(str_detect(name, "^J[A-z]+ P[A-z]+")) |>
  select(customerid, name, phone)
# A tibble: 1 × 3
  customerid name            phone       
       <dbl> <chr>           <chr>       
1       1475 Joshua Peterson 332-274-4185

Third Night of Hanukkah

Here’s the setup for Night 3.

Based on tonight’s clue, we want to find someone whose birthsign is a Cancer, and who was born in the Year of the Rabbit.

Julia Solution

First, we’ll record the date ranges of all the rabbit years over the last century.

rabbit_years = hcat(
  DATE.Date.(["1927-02-02", "1939-02-19", "1951-02-06", "1963-01-25", "1975-02-11", 
              "1987-01-29", "1999-02-16", "2011-02-03", "2023-01-22"], "yyyy-mm-dd"),
  DATE.Date.(["1928-01-22", "1940-02-07", "1952-01-26", "1964-02-12", "1976-01-30", 
              "1988-02-16", "2000-02-04", "2012-01-22", "2024-02-09"], "yyyy-mm-dd")
)
9×2 Matrix{Dates.Date}:
 1927-02-02  1928-01-22
 1939-02-19  1940-02-07
 1951-02-06  1952-01-26
 1963-01-25  1964-02-12
 1975-02-11  1976-01-30
 1987-01-29  1988-02-16
 1999-02-16  2000-02-04
 2011-02-03  2012-01-22
 2023-01-22  2024-02-09

Then we’ll add a variable :born_in_rabbit_year to identify customers whose date of birth falls in a rabbit year.

# For each rabbit year, figure out who was born in that year
customers = @rtransform customers :born_in_rabbit_year = false;

for i in 1:size(rabbit_years, 1) # Loop over rows
  global customers = @chain customers begin
    @rtransform :born_in_rabbit_year = :born_in_rabbit_year | (rabbit_years[i,1] <= :birthdate <= rabbit_years[i,2])
  end
end

Now we can filter the customers for those born in a rabbit year, who live in the same neighborhood as the customer from Night Two, and whose astrological sign is Cancer.

@chain semijoin(
  (@rsubset customers :born_in_rabbit_year),
  (@rsubset customers :customerid == 1475), 
  on = :citystatezip
) begin
  @rsubset (DATE.month(:birthdate) in [6,7])
  @select :customerid :name :birthdate :phone
end
2×4 DataFrame
 Row │ customerid  name             birthdate   phone
     │ Int64       String31         Date        String15
─────┼───────────────────────────────────────────────────────
   1 │       2550  Robert Morton    1999-07-08  917-288-9635
   2 │       4304  Stephanie Flynn  1963-06-12  838-477-9033

R Solution

For comparison, here’s my R solution. The key difference is that we were able to use the Tidyverse functions interval() and %within% to simplify the process of filtering the data for customers born in a rabbit year.

library(lubridate)

# Enumerate which years are a Year of the Rabbit
rabbit_years <- interval(
  start = ymd(c("1927-02-02", "1939-02-19", "1951-02-06",
                "1963-01-25", "1975-02-11", "1987-01-29",
                "1999-02-16", "2011-02-03", "2023-01-22")),
  end   = ymd(c("1928-01-22", "1940-02-07", "1952-01-26",
                "1964-02-12", "1976-01-30", "1988-02-16",
                "2000-02-04", "2012-01-22", "2024-02-09"))
)

# Filter for customers with a birthdate in the range of interest
  suspects <- customers |>
    # Identify customers born in the Year of the Rabbit
    filter(map_lgl(birthdate, ~ any(.x %within% rabbit_years))) |>
    # Identify customers whose birthsign is Cancer
    filter((month(birthdate) == 6 & day(birthdate) >= 22) |
           (month(birthdate) == 7 & day(birthdate) <  22))

# Narrow down to customers who share the same neighborhood as the contractor
  semi_join(
    x = suspects,
    y = customers |> filter(customerid == 1475),
    by = "citystatezip"
  ) |> select(customerid, name, phone, birthdate, citystatezip)
# A tibble: 1 × 5
  customerid name          phone        birthdate  citystatezip     
       <dbl> <chr>         <chr>        <date>     <chr>            
1       2550 Robert Morton 917-288-9635 1999-07-08 Jamaica, NY 11435

Fourth Night of Hanukkah

Here’s the setup for Night 4.

For tonight, we want to find a female customer who is often the first person to place a bakery order at Noah’s each day.

Julia Solution

I’ll start by ensuring that the datetime variables in the data are encoded as DateTime variables rather than strings.

# Ensure that datetime variables are encoded as such
orders = @chain orders begin
  @rtransform  :ordered = DATE.DateTime(:ordered, "yyyy-mm-dd HH:MM:SS")
  @rtransform  :shipped = DATE.DateTime(:shipped, "yyyy-mm-dd HH:MM:SS")
end
213232×6 DataFrame
    Row │ orderid  customerid  ordered              shipped              items ⋯
        │ Int64    Int64       DateTime             DateTime             Missi ⋯
────────┼───────────────────────────────────────────────────────────────────────
      1 │    1001        6878  2017-01-31T02:56:45  2017-01-31T09:00:00  missi ⋯
      2 │    1002        6375  2017-01-31T04:13:35  2017-01-31T12:15:00  missi
      3 │    1003        8045  2017-01-31T04:45:12  2017-01-31T10:45:00  missi
      4 │    1004        5385  2017-01-31T05:49:19  2017-01-31T09:00:00  missi
      5 │    1005        9235  2017-01-31T06:28:11  2017-01-31T16:00:00  missi ⋯
      6 │    1006        8414  2017-01-31T07:30:08  2017-01-31T19:15:00  missi
      7 │    1007        4285  2017-01-31T08:01:48  2017-01-31T19:15:00  missi
      8 │    1008        5725  2017-01-31T08:03:11  2017-01-31T08:03:11  missi
   ⋮    │    ⋮         ⋮                ⋮                    ⋮              ⋮  ⋱
 213226 │  214226        2598  2022-12-16T22:53:21  2022-12-18T15:30:00  missi ⋯
 213227 │  214227        8352  2022-12-16T22:58:00  2022-12-23T11:00:00  missi
 213228 │  214228        2886  2022-12-16T23:11:08  2022-12-20T18:15:00  missi
 213229 │  214229        8735  2022-12-16T23:16:16  2022-12-21T14:15:00  missi
 213230 │  214230        4043  2022-12-16T23:23:33  2022-12-17T15:00:00  missi ⋯
 213231 │  214231        3894  2022-12-16T23:48:43  2022-12-23T18:15:00  missi
 213232 │  214232        1368  2022-12-16T23:49:15  2022-12-17T18:30:00  missi
                                               2 columns and 213217 rows omitted

Then I’ll look at the customers who tended to be the bakery’s first customer in the morning.

# Identify bakery orders
@chain products begin
  @rsubset startswith(:sku, "BKY")
  semijoin(orders_items, _, on = :sku)
  semijoin(orders,       _, on = :orderid)
  # Subset to customers ordering before 5AM
  @rsubset DATE.hour(:ordered) < 5
  # Find the first customer of each day
  @rtransform :date_of_order = DATE.Date(:ordered)
  @orderby :ordered;
  groupby(:date_of_order)
  combine(first)
  # Count how often each customer placed a day's first bakery order
  groupby(:customerid)
  combine(nrow => :n_first_orders)
  # Pull customer information
  leftjoin(_, customers, on = :customerid)
  @select :customerid :n_first_orders :phone :name
  # View top five customers
  @orderby -:n_first_orders
  first(1)
end
1×4 DataFrame
 Row │ customerid  n_first_orders  phone         name
     │ Int64       Int64           String15?     String31?
─────┼────────────────────────────────────────────────────────
   1 │       2749               4  607-231-3605  Renee Harmon

R Solution

We’ll use the same approach for our R solution.

# First identify orders from the bakery
  orders |>
    semi_join(orders_items |> filter(startsWith(sku, "BKY"))) |>
    # Subset to orders before 5AM
    filter(hour(ordered) < 5) |>
    # Identify customers who place the first bakery order in a day
    group_by(date(ordered)) |>
    slice(1) |>
    # Count how often the customer placed the first bakery order of the day
    group_by(customerid) |>
    summarize(n_first_orders = n()) |>
    ungroup() |>
    # Find the customer with the most first-orders
    slice_max(order_by = n_first_orders) |>
    # Get customer information
    inner_join(customers, by = "customerid") |>
    select(customerid, n_first_orders, phone, name)
# A tibble: 1 × 4
  customerid n_first_orders phone        name        
       <dbl>          <int> <chr>        <chr>       
1       2749              4 607-231-3605 Renee Harmon

Fifth Night of Hanukkah

Here’s the setup for Night 5.

Let’s find this cat lady. We’ll look at orders of products for senior cats, then we’ll see which customer has the largest average quantity of such products in their orders.

Julia Solution

import TidierStrings as TDS
using StatsBase: mean
@chain products begin 
  @rsubset TDS.str_detect(:desc, "Senior Cat")
  semijoin(orders_items, _, on = :sku)
  innerjoin(orders,         on = :orderid)
  groupby([:customerid, :orderid])
  @combine :total_qty = sum(:qty)
  groupby(:customerid)
  @combine :avg_order_qty = mean(:total_qty)
  innerjoin(customers,   _, on = :customerid)
  @select :customerid :name :phone :avg_order_qty
  @orderby -:avg_order_qty
  first(1)
end
1×4 DataFrame
 Row │ customerid  name           phone         avg_order_qty
     │ Int64       String31       String15      Float64
─────┼────────────────────────────────────────────────────────
   1 │       3068  Nicole Wilson  631-507-6048        8.04762

R solution

We’ll use the exact same approach for the R solution.

# Filter for order items related to senior cats
orders_items |>
  left_join(y = products, by = "sku") |>
  filter(str_detect(desc, "Senior Cat")) |>
  inner_join(orders, by = "orderid") |>
  # For orders, calculate the total quantity of senior cat products
  group_by(customerid, orderid) |>
  summarize(total_qty = sum(qty)) |>
  # For the customers, calculate average quantity of senior cat products
  group_by(customerid) |>
  summarize(avg_order_qty = mean(total_qty)) |>
  # Find the custoemr with the largest average quantity
  slice_max(n = 1, order_by = avg_order_qty) |>
  inner_join(customers, by = "customerid") |>
  select(customerid, name, phone, avg_order_qty)
# A tibble: 1 × 4
  customerid name          phone        avg_order_qty
       <dbl> <chr>         <chr>                <dbl>
1       3068 Nicole Wilson 631-507-6048          8.05

And there we have it.

Sixth Night of Hanukkah

Here’s the setup for Night 6.

The strategy for tonight is to look for orders where every item is discounted, and then see which customer has the most such orders.

Julia Solution

This night was the most annoying to do in Julia. I wanted to filter for groups where every element in the group satisfies some condition. Specifically, I wanted to filter orders to those where every item in the order is discounted.

For this kind of filtering operation, StackOverflow search suggested that we need to use some kind of ugly syntax with the filter() function, which you can see in the solution below.

@chain orders begin
  innerjoin(orders_items, on = :orderid)
  innerjoin(products    , on = :sku)
  # Filter for orders where every item is discounted
  groupby(:orderid)
  @rtransform :is_discounted = :unit_price < :wholesale_cost
  filter(:is_discounted => d -> length(d) == sum(d), _)
  # Identify customer with the most discounted orders_items
  groupby(:customerid)
  combine(nrow => :n_disc_orders)
  @orderby -:n_disc_orders
  first(1)
  # Get their information
  innerjoin(customers   , on = :customerid)
  @select :customerid :name :phone :n_disc_orders
end
1×4 DataFrame
 Row │ customerid  name         phone         n_disc_orders
     │ Int64       String31     String15      Int64
─────┼──────────────────────────────────────────────────────
   1 │       4167  Sherri Long  585-838-9161             54

The filter() solution is dissatisfying for two reasons. First, the order of the arguments is annoying when working in a chain: it has the structure filter(my_filter_operation, things_to_filter), so in a pipe you always have to do filter(my_filter_operation, _). Second, as far as I know, we can’t just write something clean like all(:is_discounted) to filter for groups where every record meets the condition. Instead, we have to use this weird key => value where the key is a variable of interest (:is_discounted) and the value is an anonymous function d -> .... It works but it’s hard to read. Like a lot of Julia operations in DataFrames, DataFramesMeta, etc. there’s a theoretical elegance to it if you want to think through an intellectual exercise. But if you’re just a normal analyst trying to work with data, it’s utterly bizarre and makes the code hard to communicate to other people.

R Solution

In contrast, the R code for filtering orders that are entirely discounted is much more readable thanks to the use of the all() function.

# Identify orders where every item is discounted
fully_discounted_orders <- orders_items |>
  left_join(y = products, by = "sku") |>
  left_join(y = orders,   by = "orderid") |>
  group_by(orderid) |>
  filter(all(unit_price < wholesale_cost)) |>
  ungroup()
  
# Find the customer with the most fully-discounted orders
fully_discounted_orders |>
  count(customerid, name = "n_disc_orders") |>
  slice_max(n = 1, order_by = n_disc_orders) |>
  left_join(y = customers, by ="customerid") |>
  select(n_disc_orders, customerid, name, phone, citystatezip)
# A tibble: 1 × 5
  n_disc_orders customerid name        phone        citystatezip   
          <int>      <dbl> <chr>       <chr>        <chr>          
1            35       4167 Sherri Long 585-838-9161 Bronx, NY 10473

Seventh Night of Hanukkah

Here’s the setup for Night 7.

Fr this night, we’ll look at orders of items with colors, placed in the same hour as the person from Night 6 (with customer ID 4167), where the ordered item is the same product as that person’s but in a different color.

Julia Solution

  function get_color(product_desc)
    color_regex_match = match(r"\([A-z]+\)", product_desc)
    if isnothing(color_regex_match)
      ""
    else
      color_regex_match.match
    end
  end
get_color (generic function with 1 method)

orders_with_color = @chain orders begin
  innerjoin(orders_items, on = :orderid)
  innerjoin(products    , on = :sku)
  @rtransform :color = get_color(:desc) :item = replace(:desc, r" \([A-z]+\)" => "" )
  @rsubset :color != ""
  @select :customerid :ordered :item :color
end
28012×4 DataFrame
   Row │ customerid  ordered              item                  color
       │ Int64       DateTime             String                Abstract…
───────┼──────────────────────────────────────────────────────────────────
     1 │       4716  2017-01-31T10:23:50  Noah's Poster         (yellow)
     2 │       3808  2017-01-31T10:29:22  Noah's Lunchbox       (mauve)
     3 │       2645  2017-01-31T10:53:59  Noah's Gift Box       (magenta)
     4 │       2520  2017-01-31T13:45:58  Noah's Jersey         (mauve)
     5 │       8385  2017-01-31T13:47:53  Noah's Bobblehead     (puce)
     6 │       3779  2017-01-31T15:32:25  Noah's Poster         (yellow)
     7 │       8267  2017-01-31T16:11:59  Noah's Jewelry        (puce)
     8 │       4721  2017-01-31T17:50:31  Noah's Jersey         (white)
   ⋮   │     ⋮                ⋮                    ⋮                ⋮
 28006 │       9105  2022-12-16T17:45:34  Noah's Jewelry        (puce)
 28007 │       2824  2022-12-16T17:46:36  Noah's Jewelry        (yellow)
 28008 │       6222  2022-12-16T21:57:11  Noah's Action Figure  (orange)
 28009 │       5950  2022-12-16T22:00:02  Noah's Poster         (mauve)
 28010 │       8352  2022-12-16T22:58:00  Noah's Action Figure  (yellow)
 28011 │       3894  2022-12-16T23:48:43  Noah's Lunchbox       (green)
 28012 │       1368  2022-12-16T23:49:15  Noah's Gift Box       (magenta)
                                                        27997 rows omitted

@chain orders_with_color begin
  @rtransform :date_ordered = DATE.Date(:ordered) :hour_ordered = DATE.hour(:ordered)
  semijoin(_, filter(:customerid => isequal(4167), _), on = [:item, :date_ordered, :hour_ordered])
  antijoin(_, filter(:customerid => isequal(4167), _), on = [:item, :color, :date_ordered])
end
1×6 DataFrame
 Row │ customerid  ordered              item           color      date_ordered ⋯
     │ Int64       DateTime             String         Abstract…  Dates.Date   ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │       5783  2018-12-31T12:26:39  Noah's Poster  (orange)   2018-12-31   ⋯
                                                                1 column omitted

R Solution

# Identify orders of items that come in different colors
orders_with_color <- orders |>
  full_join(orders_items) |>
  full_join(products |> select(sku, desc)) |>
  mutate(color = str_extract(desc, "\\([A-z]+\\)"),
         item  = str_remove(desc, fixed(color)) |> str_trim()) |>
  filter(!is.na(color))

# Identify orders of the same item but different colors,
# on the same day but preceding Sherri's order
orders_with_color |>
  group_by(date(ordered)) |>
  filter(any(customerid == 4167)) |>
  filter(ordered <= (ordered[customerid == 4167])) |>
  filter(item   %in% unique(item[customerid == 4167]),
         !color %in% unique(item[customerid == 4167])) |>
  filter(n_distinct(customerid) > 1) |>
  mutate(lead_time = ordered[customerid == 4167] - ordered) |>
  ungroup() |>
  filter(customerid != 4167) |>
  slice_min(order_by = lead_time) |>
  left_join(customers |> select(customerid, name, phone)) |>
  select(customerid, name, phone)
# A tibble: 1 × 3
  customerid name         phone       
       <dbl> <chr>        <chr>       
1       5783 Carlos Myers 838-335-7157

Of the three people who bought a different-colored poster before Sherri did, one of them bought it just before Sherri did: Carlos.

Eighth Night of Hanukkah

Here’s the setup for Night 8.

For this night, we want to find someone who owns all of the different collectible items. The hiccup here is that collectibles (identified by the product code with COL in the name) come in different colors, which we don’t care about. So we’ll remove the color from the product name, and then identify someone who owns each collectible item.

Julia Solution

collectibles = @chain products begin
  @rsubset startswith(:sku, "COL")
  @rtransform :collectible = replace(:desc, r" \([A-z]+\)" => "" )
end
85×5 DataFrame
 Row │ sku      desc                           wholesale_cost  dims_cm         ⋯
     │ String7  String                         Float64         String15        ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ COL0037  Noah's Jewelry (green)                  28.32  17.4|11.2|5.7   ⋯
   2 │ COL0041  Noah's Ark Model (HO Scale)           2487.35  7.2|4.3|0.4
   3 │ COL0065  Noah's Jewelry (mauve)                  33.52  19.0|12.2|10.5
   4 │ COL0166  Noah's Action Figure (blue)             13.98  12.1|7.7|7.2
   5 │ COL0167  Noah's Bobblehead (blue)                 5.36  8.9|5.6|0.6     ⋯
   6 │ COL0483  Noah's Action Figure (mauve)            13.95  10.3|2.5|0.2
   7 │ COL0837  Noah's Poster (mauve)                    3.86  13.0|3.3|0.8
   8 │ COL0891  Noah's Bobblehead (orange)               5.65  19.9|14.8|0.3
  ⋮  │    ⋮                   ⋮                      ⋮               ⋮         ⋱
  79 │ COL9192  Noah's Poster (amber)                    4.76  19.6|16.2|11.5  ⋯
  80 │ COL9241  Noah's Gift Box (orange)                 2.43  17.4|13.1|2.2
  81 │ COL9349  Noah's Action Figure (orange)           15.47  16.6|12.9|11.9
  82 │ COL9420  Noah's Jewelry (amber)                  30.01  13.8|9.1|4.9
  83 │ COL9773  Noah's Poster (magenta)                  4.13  7.2|4.9|1.9     ⋯
  84 │ COL9819  Noah's Lunchbox (blue)                   9.87  14.2|4.1|3.3
  85 │ COL9948  Noah's Jewelry (magenta)                32.97  18.0|17.5|6.2
                                                    1 column and 70 rows omitted

n_distinct_collectibles = @chain collectibles begin
  @distinct :collectible
  nrow
end
8

@chain orders_items begin
  innerjoin(collectibles, _, on = :sku)
  innerjoin(orders      , _, on = :orderid)
  @distinct :customerid :collectible
  groupby(:customerid)
  combine(nrow => :n_collectibles)
  @rsubset(:n_collectibles == n_distinct_collectibles)
  innerjoin(customers   , _, on = :customerid)
  @select :customerid :name :phone :n_collectibles
end
1×4 DataFrame
 Row │ customerid  name         phone         n_collectibles
     │ Int64       String31     String15      Int64
─────┼───────────────────────────────────────────────────────
   1 │       3808  James Smith  212-547-3518               8

R Solution

# Identify all of the item identifiers of collectibles
# and create a "collectible" variable that deduplicates
# the same collectible available in different colors
collectibles <- products |> 
  filter(startsWith(sku, "COL")) |>
  mutate(collectible = str_remove(desc, " \\(.+")) |>
  distinct(sku, collectible)

# Count the number of distinct collectibles (ignoring color)
n_distinct_collectibles <- collectibles |>
  distinct(collectible) |>
  nrow()

# Find a customer with every distinct collectible
# and get his name and phone number
orders_items |>
  inner_join(collectibles) |>
  inner_join(orders) |>
  distinct(customerid, collectible) |>
  group_by(customerid) |>
  summarize(n_collectibles = n()) |>
  filter(n_collectibles == n_distinct_collectibles) |>
  left_join(customers) |>
  select(customerid, name, phone, n_collectibles)
# A tibble: 1 × 4
  customerid name        phone        n_collectibles
       <dbl> <chr>       <chr>                 <int>
1       3808 James Smith 212-547-3518              8