$$y = f(x)$$

Good morning. I'm Olivia Freeman, a litigation partner here at Stackpole and Jimenez. I will interview you to assess your potential as a trainer for data scientists who will be delivering testimony before juries in major cases. We will be discussing an example based on pre-Y2K data for which the original COBOL code has been lost.

Take a look at this sample and tell me what you see.

620347063090855923807917997721858608661631068995235645378177104014767789652743964159803229483770841925589832892292286986403761691423637401807201523714950545444657339621140437121496260729613875570484352659376212853376846957603814132646208222068325728455853874908148274501059593957510317085652325675520351934346260504951871467428789316903908272956086961807799797199312024121156303540074648923105179957771270719085644273935237604648129436277640192606630193228886090859680895145544942484962281060740562657262996913071516966586267146409590992990833463227092303587095413133793946056899174208741717523342410635798360794031438840849620226166116608669132462285321719195558629785984518646371576322248617748237723088631053508253781638859668096278037543791868245934539321131584695723882017923455752950 [continues for several megabytes]

A: That's a very long sequence of digits. By "digits", I mean numbers as they would appear in text because this chunk would be far to large to use directly as numbers for calculation.

Q: How do you know that it's all digits?

A: I don't. That's all I see, but there could be letters and other characters later.

Q: How would you confirm that?

A: I would write a one-line computer script. Assuming the file has been loaded into an interactive session using the R programming language, for example, I could use grep("[^0-9]",x) and if I got back a 1, it would mean that there were non-digits and a 0 would indicate all digits.

Q: Translate that code into plain English, please.

A: Sure. It says use the grep function to look for anything that is not a digit 0 through 9 in the variable x.

Q: What does grep stand for?

A: It's "global regular expression and print," similar to a "global search" in a word processing document. A regular expression is a set of rules for describing patterns in text. What does the text represent?

Q: That's a good question. The plaintiff claims that it is evidence of financial fraud. The defendant denies that and says it is simply data for a parts management system that was used by a predecessor company.

A: What else can you tell me about the data?

Q: The data is in blocks of 20 characters. Each block may, or may not contain one or more digits that encode a parts bin location, somewhere in multiple facilities.

A: How does that work?

Q: It's the digits before the last "0", if any, back to, but not including, the next-to-last "0", if any.

A: Ok, let' see. ... So, for the first block, it would be

"62034706309085592380" --> 8559238

Q: How would you check the data for completeness?

A: Without more, I can only check to see if x is evenly divisible by 20.

Q: How would you do that?

A: That's another simple one-liner: nchar(x) %% 20.

Q: Explain that.

A: It's saying count the total number of characters and see if there is anything left over after dividing by 20. Same as there's nothing leftover from dividing eight by two, but dividing seven by two leaves a remainder of one.

Q: What does the double percent sign mean?

A: That's the modulus operator, the fancy term for finding the remainder. If it is 0, that means x can be evenly divided by 20. Anything else means it can't so it would mean leftover digits after making the blocks, which could indicate missing data from the last chunk or digits missing scattered throughout.

Q: Back to your example for the first block. How did you do that?

A: Just took a pencil and stroked out the last "0", at the end, and looked before that for the "0" in the middle and stroked it out back to the beginning.

Q: That simple? The whole file could be done by hand?

A: Haha! With enough patience!

Q: Take a few minutes to look over this script.

# Split the string into 20 character blocks
blocks <- strsplit(input_string, "(?<=.{20})", perl = TRUE)[[1]]

# Function to find the index of the last "0" in each block
find_last_zero <- function(block) {
  # Reverse the block to find the first "0" from the right
  reversed_block <- rev(unlist(strsplit(block, "")))
  # Find the index of the first "0"
  index <- which(reversed_block == "0")[1]
  # Return the numeric characters preceding it back to the next "0"
  return(paste(reversed_block[1:(index-1)], collapse = ""))
}

# Apply the function to each block
results <- sapply(blocks, find_last_zero)

# Combine the block sequence number with the results
final_results <- paste(seq_along(blocks), results)

Q: Comments?

A: That's one way to approach it, I guess. It needs fixing. It's probably preferable to work from that rather than what's called a for loop where you take each block in turn and start reading the characters and saving them until finding a "0" and then start in again until finding another "0", etc. I mean it in the sense the for loop would be harder to explain to a lay audience.

Q: What needs fixing?

A: Let's use the same block example, which has a "0" at the end. That will make the result of applying the find_last_zero function come out as "0", instead of "8559238" which is what we get by hand. Where did you get this?

Q: A bot. Can you fix it?

A: Yeah, but I wouldn't.

Q: Why not?

A: Well, even corrected, this code is too terse and hard to read.

Q: How would you do it?

A: I like to think of data problems in terms of school algebra: y = f(x) where y is what we want, x is what we have and f is a function to make the transformation. f can be simple, such as find the square root of x where x = 64. There's a built-in function to take a square root, sqrt. For our data there's obviously no single built-in function, so we have to compose one. Keeping attention on what, rather than how, makes a program easier to follow.

Q: Where would you start?

A: By designing the structure of y, to hold the result. It's pretty simple, a single column with as many rows as there are blocks. The contents will be either a string of digits or NA indicating missingness. Initially, the contents will be all blank.

Q: Why wouldn't you use actual numbers?

A: Three reasons One, we begin with strings, not numbers. Two, 20-digit numbers are unwieldy. Three, we aren't going to be using it to calculate anything. The results are just codes representing a location. They're tokens and could just as well be baby names.

Q: What's next?

A: Making a copy of the data to work from and keeping the original data in read-only form. As discussed earlier, I'd check to make sure all the characters were digits and that it divides into 20-character blocks. If it passes those tests then it's our x, a very long run of digits. The what to make next is x chopped up into 20-character blocks.

Q: What do you do with the blocks?

A: Classify them. Blocks without zeros. Blocks with all zeros. Blocks with single zeros that have it in the first position. Blocks with multiple zeros all at the end. These we just need a block sequence number and we'll let those rows in y be NA. Blocks with single zeros will be simple. We just delete from zero to the end. Blocks with multiple zeros we need to delete everything up to the next-to-last zero and the last zero. The remaining portions of these blocks get slotted into y at their row positions in x. I can write it out and send it to you if you like. It will be easy to make up a file to play the role of x.

Q: Thank you. I'll look forward to seeing your script.R

The R Data Scientist has an interview with a litigator

Subscribe to my newsletter

Richard Careaga

Richard Careaga