Word count code feedback

Think of the feedback below as PR comments on your code. I have reviewed word count implementation written by many developers and condensed my common PR comments below. When you implement wc in Go, these are some things you should look into. These will help you evaluate whether your solution is good enough.

Argument Parsing and configuration

  • How are you parsing the CLI arguments? Are you parsing them manually by doing string splitting and using a giant switch case statement? If yes, then please use eitherflag.Args() from Golang standard library or Cobra. Go is suitable for writing CLI applications and supports typical tasks like parsing flags and arguments passed to the command. Make use of these packages and refactor your code accordingly. When you use Cobra, it will force a certain structure to your code. Read more about Cobra documentation and samples to find out. By the way, Docker, Kubernetes, and many other tools use Cobra for parsing arguments. When in doubt, for real-world projects, use Cobra.
  • How do you represent flags passed to the wc command inside your code? Have you considered centralizing the three flags (-l, -w and -c) in a struct called WcFlags? If you centralize all the flags in a single struct, you can easily pass that struct to the internal method instead of passing each of the flags (booleans) as separate parameters. Refactor your code to use a centralized config struct to represent all configuration options of the wc command.
  • Ideally, you should have a function that parses all arguments and converts them into an internal WcFlags struct that you can pass to other methods. Otherwise, you’ll end up with wordCount method with multiple arguments for the wc command options.

Code formatting and tooling

  • Ensure you have configured your code editor (we recommend using VS Code or Goland) so your code is auto-formatted using gofmt.
  • You should be able to run unit tests easily using your code editor, a single button click, or a keyboard shortcut.
  • You should be able to debug the code and test cases using breakpoints directly in your IDE.
  • Don’t add extra newlines when not needed. Newlines should be used in the code to improve readability and to separate functions and blocks of code.
  • Follow Go’s standard variable naming convention - capital letters for exported variables, lowercase letters for unexported variables and functions, camelCase for variables and methods, etc. Read https://go.dev/doc/effective_go for more detailed info.

Unit testing and error handling

  • Have you written test cases for the code? Can you identify edge cases and failure points where wc can fail? Some examples are: you run wc in a directory where you don’t have read permissions. How does your program behave in that case? Does it fail with a proper error message or panic abruptly? Can you write a unit test for this scenario? What other error scenarios can you think of?
  • Do you know about table-driven tests in Go? Can you identify the test scenarios for grep and refactor the code to use table-driven tests? Did you know that you can use VS Code feature to generate test cases by right-clicking the method name? It will auto generate table driven tests for that function.
  • Are you testing the case where your program reads from STDIN? If not, how can you test your program, in this case, automated manner (without manually passing the input)? Find out how to write a test for reading input via a standard input stream. (hint: you can use io.Reader interface to make it easier to "mock" standard input stream in Go). Search about this on StackOverflow and come up with some solution.
  • Write tests in standard got and want fashion. Understand the difference between t.Errorf and t.Fatalf and use t.Errorf unless you really need to fail and stop the execution of the entire test suite.
  • Don’t panic unless the program absolutely cannot recover from failures. Use appropriate error handling for all error conditions (business-related error cases).
  • Use handle or declare rule for error handling. Either your method returns an error or handles it (by printing/logging it or converting the error into an appropriate business value).
  • Are you creating the data needed for the test cases (directory and file structures along with some dummy file contents), or do you have a pre-defined folder structure for testing? How will you ensure your test cases run on any other person’s machine (or even on CI)?
  • Have you tried testing your program in a directory tree containing many thousand files? How about testing the code on large files (GBs in size)? Does it work as expected, or does it fail silently? Does it go out of memory (if you try to run it with some memory limits)? How are you load-testing your implementation of wc?

Power of interfaces

  • Do you have two separate functions - one for reading from STDIN and the other for reading from a file? Is it possible to abstract the reading part into a common functionality? If you still didn’t get the hint, try using io.Reader interface to create this common abstraction for both STDIN and os.File. When you do refactor your code to use io.Reader, notice how your test cases become simpler.

Performance (when running wc on many large files)

  • Are you reading the entire file in memory and then counting lines, words, and chars in that file? Your program will work fine for small files, but will it be able to word count in a 10GB file where your laptop RAM is 8GB? Can you refactor your code to count the lines, words, and chars from a file in a “line by line” fashion? Basically, don’t read the entire file in memory as a single string (or even []string). Instead, read the file line by line, and for each line, increment relevant counters (line, word, and char counts).
  • Can you think of performance optimization when you run a word count into a large directory (containing thousands of files)? Can you get the results, or does the program fail with an error like “too many file descriptors”? How can you solve this? (hint: limit the number of max file descriptors by using a buffered channel). If you’re using a goroutine to read from a file, you’ll need to limit the concurrency of goroutines by having a maxOpenFileDescriptors variable.
  • Load test your implementation by generating very large files (> 5Gb) and also a very large number of files (thousands) in a single directory. You can generate these files using something like this on most Linux systems.
    • base64 -i /dev/urandom | tr '+' '\n' | tr '/' ' ' | head -c 10000000
      This will create a text file with many lines and words with 10Mb in size. Use this to create large files by modifying the 10Mb limit.

Handling Results

  • Do you think creating a struct for storing word count results makes sense? This way, you can pass it around easily if needed. Also, you can store any errors in the result as well along with the name of the file.
  • Ideally, there should be only one final fmt.Printf statement, everything else should return a value. This way, you can calculate and return the results for printing (as a side effect).
  • How do you calculate the last total row if you’re doing word count on multiple files?

Others (Misc)

  • When running word count, can you exclude non-text files (e.g., binary files or image or movie files)? Can you implement some flag (e.g. --exclude-files=jpg,mov,png) to exclude all these file types in the directory? Alternatively, you can implement another flag (e.g. --include-files=txt,md,go) to only include relevant file types when calculating the word count.
  • For CLI programs such as wc, we should NOT implement interactivity. i.e., Don’t make the CLI behave in an interactive way, where you first ask the user the file path via a prompt, then ask your (via a prompt) whether the user wants line, word, or char counts. DON’T DO THIS. EVER. A good CLI should be non-interactive and follow Unix Philosophy (read more here - http://www.catb.org/~esr/writings/taoup/html/ch01s06.html). Don’t break the following important rule of Unix Philosophy.
    • Expect the output of every program to become the input to another, as yet unknown, program. Don't clutter the output with extraneous information. Avoid stringently columnar or binary input formats. Don't insist on interactive input.
  • Ideally, you should not have multiple fmt.Printf calls throughout your code. Most functions should be returning data (or pushing the data onto a channel) instead of performing the side effect of printing the output directly. Always be lazy and write pure functions without side effects as much as possible.
  • Arrange your code and functions such that they follow the step-down rule. This way, reading the code becomes much easier.