The Bicameral Mind

By NASA/STS-34 – https://danielmarin.naukas.com/files/2015/09/28_Galileo_deployment.jpg, Public Domain, https://commons.wikimedia.org/w/index.php?curid=6049516

I remember an English lesson I received in grammar school that serves me well in our current disinformation age. I learned to read news from several sources to decide what was actually true. Through experience I learned which sources were verifiable and trustworthy, but even trustworthy sources could occasionally get it wrong. Reading from several sources, and looking for primary sources of information, allow me at least a slight chance of determining truth.

The Internet and social media have not made it easier to determine the truth. I’m disgusted with the number of people who ask their social media acquaintances for answers that they should be getting from reliable sources, or they should already know. I really don’t think you should ask the Internet if you should take a child to the doctor if they swallowed a bolt, or has a rash.

The Internet lies, and even worse, it changes. We all heard about Edward Welch driving to a D.C. area pizza parlor with an assault rifle to stop a sex trafficing ring being run from the parlor’s non-existent basement. From my personal experience, I distinctly recall one of the US Air Force’s first female combat pilots firing an anti-satellite missile at the Solwind satellite. Wikipedia reports that same event as a man shot the satellite. I’m too lazy to dig up the microfiche of the company newsletter proudly touting that a woman combat pilot fired our missile, so I have no evidence to edit the Wikipedia article.

One of the other things I learned from that time was the design protocol to never trust automation to “fire ordnance”. If, for timing purposes, we required computers to calculate the firing time, we used an array of electromagnetic relays to take the output of multiple computers to arrive at a consensus. Even then, a human initiated the chain of events leading up to the stage separation. On the Inertial Upper Stage, deployed from the Space Shuttle, there is a lever on the side of the booster that turns the booster on. Its impractical to have an astronaut suit up to flip the lever, so there was the equivalent of a tin can over the lever tied to a string (a “lanyard” in NASA-speak) with the other end tied to the Space Shuttle. The atronaut mission specialist would flip the switches to tilt the booster’s cradle in the Space Shuttle and release the springs to push the booster out. As the booster drifted out the string would tighten and flip the lever turning on the booster. Only then would the booster’s computers boot up, calculate the booster’s position, and wait until the booster had drifted far enough away from the Space Shuttle to fire its attitude control jets to turn booster around and fire the first stage.

The US military also had us apply this principle to weaponry. Automation could not initiate anything that could kill a human. A human needs to be holding the trigger on anything that could kill another human.

Same principles of not trusting automation applied to finances. The first NASDAQ trading islands were required to delay the quote information on stocks by several seconds before it reached the bidding systems. This was to discourage feedback loops in automated trading systems. Since then those limits have been eliminated, or just become ineffective. At least once the stock price of a perfectly good company was driven to zero causing a suspension of trading. After a day, the stock returned to its “normal” expected price. The SEC has already commented that high frequency trading, based on nothing but fluctuations of stock prices, is driving out the research driven investors that look at the fundamentals of company like profitability, indebtedness, gross sales.

When it comes to AI in the modern world, these experiences suggest some fundamental rules:

  1. A human must initiate any action that can potentially harm another human.
    • Corollary: Industrial processes that may release toxic materials, even for safety reasons, must be initiated by a human.
    • Corollary: Automation cannot directly trade on the financial markets. Markets are for humans. High speed trading is harmful to the economy.
  2. When automation has indirect control of potentially hazardous processes (such as firing a booster after a human has enabled it), multiple redundant processes must reach a consensus to order an action.

Given the malleability of the web I was really surprised that OpenAI released ChatGPT into the wild without close supervision. ChatGPT has no means of checking the information fed to it but it learns from everything, so we should not be be surprised it now helps feed the flood of misinformation.

A large language model like ChatGPT is nothing but a bunch of nodes doing matrix multiplies. Modern neural networks add an occasional convolution algorithm. There is nothing in the model to drive inference or intuition. It responds purely from the data it was trained on.

A medical DNA testing company recognized the LLM (large language models) were statistical in nature, but they still needed AI to scale up the review of DNA results. Human review of DNA results just wasn’t able to keep up. They wisely created a system to monitor results and periodically manually check the results to retrain the model as needed.

Even now, though, really big LLMs like Bard, ChatGPT, are too big for effective monitoring. We don’t have a way to untrain a model once it has bogus data in it. One way to help a LLM defend itself is to create another LLM that is only trained from proctored data. That second LLM helps trains the first LLM to recognize bogus sources. The proctored LLM will help the owners determine when they need to throw away the big LLM if it strays.

Now the rubber hits the road. A corporation has spent millions training the LLM, so they will be reluctant to just throw it away. Even though the large AI occasionally lies and hallucinates, it is useful most of the time. Legal regulation is doomed to failure, so we must resort to competition where companies with LLM’s advertise how well they monitor their AI. An industry group or even a government agency could rate the AIs for consumer protection — just like the U.S. government publishes ontime performance of airlines and their safety record.


Coding Snippet

Probably as part of a job interview, I came up with modifying heap sort to remove duplicates from an array. I haven’t seen the technique used before, but I have to think lots of undergraduate computer sci majors have come up with the same algorithm, so this exact representation is copyrighted, but feel free to modify it and use it under an MIT style license (shame on you if you attempt to put a more restrictive license on it):

// -*-mode: c++-*-
////
// @copyright 2024 Glen S. Dayton.
// Rights and license enumerated in the included MIT license,
//
// @author Glen S. DaRyton
//
#include <cstdlib>
#include <algorithm>
#include <iterator>

// Delete duplicate entries from an array using C++ heap.
template<typename  random_iterator>
auto deleteDuplicates(random_iterator start, random_iterator stop) -> random_iterator
{
    auto topSortedArray = stop;
    auto theArrayLength = distance(start, stop);
    auto heapEnd = stop;

    if (theArrayLength > 1)
    {
        // Transform the array  into a heap ( O(n) operation (linear) )
        // The range [start, stop) determines the array.
        std::make_heap(start, stop);

        // Perform a heap sort
        // Pop the first element, which is the max element.
        // Shrinks the heap leaving room for the max element at the end.
        // pop_heap is a O(log N)  operation.
        auto prevElement = *start;
        std::pop_heap(start, heapEnd);
        --heapEnd;

        // push the max element onto the sorted area
        // End of the array is the sorted area.
        --topSortedArray;
        *topSortedArray = prevElement;

        // Work our way up. Pop the max element of the top of the heap and write it to the top of the sorted area.
        while (heapEnd != start)
        {
            auto currentElement = *start;
            std::pop_heap(start, heapEnd);
            --heapEnd;

            if (currentElement != prevElement)
            {
                --topSortedArray;
                *topSortedArray = currentElement;
                prevElement = currentElement;
            }
        }
    }

    return topSortedArray;
}

You may find this code and its unit tests at https://github.com/gsdayton98/deleteduplicates.git

Barbarians at the gates

(Creative Commons Licensed)

I’ll never forget the day my manager responded to my suggestion for more unit testing, “So you want to spend time writing non-revenue code”. Only then did I begin to realize the corporate world was changing. My ideas about focus on quality, creating happy repeat customers, valuing and helping employees succeed and be happy, were consigned to the ash-heaps of history.

In my early professional development two books were required reading: In Search of Excellence by Tom Peters and Robert H. Waterman Jr., and Built to Last by Collins and Porras. Over the next decade, emphasis on the concepts contained in those books faded. A lot of corporate training disappeared with the company saying “training is the employee’s responsibility”. Problem is, when a company works on cutting edge technology, no other place exists for training in that specialty.

What I didn’t know then was that the seeds of destruction were already sown with a small change in law the previous decade. In 1982 the Reagan administration eliminated the rules limiting company stock buy backs. In the past a company buying back their own stock to pump up the stock price was against the law as “stock manipulation”. Stockholders love them because of that jacked up stock price. About the same time companies started compensating executives with even more excessive amounts of stock, so executives love stock buybacks. Meanwhile to pay for the stock buyback, the company shorts money needed for research, training, and maintenance. The company sacrifices the factors it needs for its long-term survival for short term gain of a few. Companies have forgotten that in addition to stockholders, they also serve their customers, employees, and communities. Forgetting any one of a company’s constituencies has real world evil results. Consider deaths the 737 Max, or the Tesla self-driving mode, caused. Focusing on just the stockholders forces the company to focus on short-term results rather than safety, long-term growth and success in the long-term.

So what does this have to do with programming? People reading this blog are smarter than average. Smarter than average engineers are more likely to found their own companies or work in start-ups. When you have a chance to influence your company, please think about how you want the company to run a decade from now. Just because it’s legal doesn’t make it right for your company to follow the trend. Build your company to last, and you’ll be wealthy but not obscenely so, and you’ll make your community and country stronger — and you’ll still have a company that hasn’t robbed someone blind, or killed someone.

In the past I advocated leaving a company if the company won’t focus on quality. That position met a lot of resistance from people needing to feed their families and needing to keep a roof over their head. Happily, despite many short-term set backs, there are still plenty of tech jobs out there. The tech economy always grows in the long term — so I still advocate leaving a bad company, rather than ruin your career and reputation to be smeared working in that company. Here in the Silicon Valley, the idea has gained traction. The Great Resignation continues because companies not looking at the long-term over-emphasize short-term productivity and under-value creativity and innovation. Employees at the biggest companies in the Silicon Valley are unionizing (even though I fear they will find the most powerful union organizations are just another variety of corporation and corporate morals). Perhaps some companies still exist that benefit everybody and not just the executives.


In the meantime, lets talk some programming. Make more your life more tolerable with a good IDE (Interactive Development Environment). Over the years I’ve used several free IDEs and several expensive ones.

For free, you can still use emacs in electric-c mode. You can do the usual IDE things such as build and debug in emacs, but you’ll suffer through all the customization you need to do to make it usable.

Even better you may download Microsoft’s Visual Studio Code (aka VSCode) https://code.visualstudio.com/. I’ve taken a few classes to keep my C++ skills up-to-date, and they all used VSCode. VSCode is great for toy programs, but you’ll find yourself editing XML configuration files for anything major. It does not integrate well with major build systems.

Other free systems, such as CodeBlocks, https://www.codeblocks.org/, just haven’t been close enough to prime time for my use.

Eclipse CDT remains popular, but I found it painfully slow. You may pay money for a plug-in from SlickEdit to make it more modern with built-in and faster syntax checking, and code completion. SlickEdit also offers a fully integrated IDE (https://www.slickedit.com/). I payed the bucks for the “Pro” version for years but a year ago I gave up on Slickedit just because it didn’t keep up with the evolving C++ language, nor did it integrate with other build systems like CMake, Make, and Ninja. On top of that it depended upon an obsolete version of Python (fixed since then).

I now pay my money to the Jetbrains (https://www.jetbrains.com/clion/features/) for their CLion product. I love it is extensibility. Out of the box it supports CMake, and Make, and ninja builds. It also supports custom build commands. I’m sorry this sounds like an advertisement, but it is what I use. On Linux systems the latest version sometimes crashes, but I haven’t lost anything. It nicely integrates with Git and Github, and allows integration with other cloud services. Oh, it also supports Rust and Python.


Coding Ugly

A while ago I found this chunk of code in an old application:



struct Ugly {
    int wordA;
    int wordB;
    int wordC;
    int wordD;
};

...
    int *word = &words.wordA;
    for (int k=0; k < 4; ++k) sum += *word+;


At the time it looked wrong to me, and it still does, but because it was in "production" code, I left it alone because any change would trigger another cycle of QA testing. If the project had unit testing coverage, then I could have just changed it and maintained the functionality of the application. Factors that made the implementation wrong included the meanings, and hence the representations, of the components wordA, wordB, wordC, wordD, were likely to change, or even additional components could be appended. The engineer that wrote this actually said if those events happened, they would just re-write this code.

To fix this beast I should have written a unit test, and then turned the structure into an object. I could have blindly made an object that would implement an array, with specialized accessors for the components wordA, B, C, D.      Rather I should have spent and extra hour doing a little analysis to determine the intent of the structure (Object Oriented Analysis), and then proceeded to Design and Programming. Khalil Stemmler has an excellent summary of the object-oriented programming cycle at https://khalilstemmler.com/articles/object-oriented/programming/4-principles/ .

Here is a more object-oriented look at the problem, giving tags (wordA, wordB, wordC, wordD) to the elements of the array, and protecting access to the underlying data:

#include <algorithm>
#include <array>
#include <initializer_list>
#include <numeric>


class Ugly {
private:
    std::array<int, 4> words;

public:
    Ugly(std::initializer_list<int> l) : words{} { std::copy_n(l.begin(), words.size(), words.begin());  }

    auto wordA() const -> auto { return words[0]; }
    auto wordB() const -> auto { return words[1]; }
    auto wordC() const -> auto { return words[2]; }
    auto wordD() const -> auto { return words[3]; }

    auto sum() const -> auto { return std::accumulate(words.begin(), words.end(), 0); }
};

Its still ugly, because C++ doesn’t yet allow an std::array initialization from an initializer_list. A complaint I have about C++ is that new features seem to require a lot of bookkeeping code. Other languages, such as Rust, allow definitions of syntactic sugaring to avoid bookkeeping code.

Memory Mapped Files

Penguins in the desert

Remember the old personal digital assistants, otherwise known as PDAs? One in particular, the Palm Pilot, had an interesting operating system. Once you created or opened an application or file, it was open forever. It was just a matter of navigating through the screens to find it again, and it was instantly usable again. Its 32-bit Motorola chip allowed it to address the entire device’s contents. All files resided in memory (at least that is what I surmised). This resulted in zippy performance and never worrying about saving a change.

A Palm Pilot

If only we could do that with a full size operating system. Now that we have 64 bit addressing, we can address a huge chunk of the planet’s data — but just a chunk of it. According to Cisco, and cited in Wikipedia, the planet entered the Zettabyte Era in 2012. We would need at least 70 addressing bits to address the entire planet’s data. Nevertheless, 64 bits allows the addressing of every byte on 16 million terabyte disk drives.


Of course, the modern CPUs in new machines can’t really directly address every byte of 16 million terabytes. They’re still limited by the number of physical address lines on their processor chips, so my little machine has only 64 GB of physical memory in it, not counting the extra memory for graphics.

Nevertheless, an immense number of problems can be solved entirely in memory that were previously solved using combinations of files and memory (and magnetic tapes and drums). Essentially, though, you still have the problem of reading data from the outside world into memory.

In processing large data files for signal process, I discovered (or re-discovered) that memory mapping a file was much faster than reading it. On the old VAX/VMS system I used back then, the memory mapped method was an order of magnitude faster. On more modern systems, such as Windows, Linux, and MacOS, memory mapping sometimes works many times faster:

50847534 primes read in
Read file in 0.503402 seconds.

50847534 primes scanned
Scanned memory map in 0.00546256 seconds

Memory read time is 92.155 times faster

The timings include the time to open the file, set up the mapping, and scanning the contents of the file, and closing the mapping and file.

To get this magical speed-up on POSIX like systems (OSX, Linux, AIX,…) start with the man page on mmap . On POSIX you basically open an existing file (to get a file descriptor), get the length of the file, and map it, and get a pointer to it.

On Windows, it’s slightly more complicated. Start with Microsoft’s documentation at https://docs.microsoft.com/en-us/windows/win32/memory/file-mapping. Open an existing file (to get a HANDLE), get the file length, map it, and do an additional step to get a FileView on it. You may change the FileView to get at different sections of the file. Evidently that is more efficient that just creating another mapping.

On POSIX-like systems with mmap you may create multiple mappings on the same file. POSIX mmap appears to be really cheap so you may close a mapping and make another one in short order to get a new view into the file.

Of course you can hide all the operating system specific details if you use the Boost shared mapping to map a file: https://www.boost.org/doc/libs/1_79_0/doc/html/interprocess/sharedmemorybetweenprocesses.html#interprocess.sharedmemorybetweenprocesses.mapped_file . With Boost you create a file mapping object given a file name, then get a pointer to the memory the size available with a mapping region created from the mapping object.

Generally, if you’re not using Boost, you’re wasting your time. Many of the features of C++11, 17, and 20, were first tried out in Boost. A lot of thought and review goes into the Boost libraries. As with all good rules of thumb and examples of group think, there are exceptions. Boosts attempt to isolate operating system dependent functions behind an operating system independent interface is an example of an implementation is just going to cause trouble — different operating systems have different implementation philosophies. In Windows the FileMap function just maps a section of a file into a section of memory, while in Linux, MacOS or OSX, and UNIX like systems mmap has many functions — mmap is the Swiss army knife of memory managment. The Boost interface provides only the file mapping, and attempts to emulate the file view of Windows on Linux, and none of the other functions of mmap.

For example, give mmap a file descriptor of -1, and a flag, MAP_ANON or MAP_ANONYMOUS, it will just give you a new chunk of memory. For really fast memory management, place a Boost Pool on the newly allocated memory with a C++ placement new operator.

For another example of why low-level access is handy, a file may be mapped in multiple processes. You may use this shared area for interprocess shared semaphores, condition variables, or just shared memory. If you use the MAP_PRIVATE flag, modifications are private to the processes. Changes cause a writable copy of the page to be created to contain the modification. The other process doesn’t see the change. MAP_SHARED, though, allows all changes to be shared between the processes.

Without further ado, here is the code that produced the benchmark above:

// The Main Program
#include <cstdlib>
#include <iostream>
#include <stdexcept>
#include "stopwatch.hpp"

auto fileReadMethod(const char*) -> unsigned int;
auto memoryMapMethod(const char*) -> unsigned int;


auto main(int argc, char* argv[]) -> int {
  int returnCode = EXIT_FAILURE;
  const char* input_file_name = argc < 2 ? "primes.dat" : argv[1];

  try {
    StopWatch stopwatch;
    std::cout << fileReadMethod(input_file_name) << " primes read in" << std::endl;
    auto fileReadTime = stopwatch.read();
    std::cout << "Read file in " << fileReadTime << " seconds." << std::endl;

    std::cout << std::endl;
    stopwatch.reset();
    std::cout << memoryMapMethod(input_file_name) << " primes scanned" << std::endl;
    auto memoryReadTime = stopwatch.read();
    std::cout << "Scanned memory map in " << memoryReadTime << " seconds" << std::endl;

    std::cout << std::endl;
    std::cout << "Memory read time is " << fileReadTime/memoryReadTime << " times faster"  << std::endl;

    returnCode = EXIT_SUCCESS;
  } catch (const std::exception& ex) {
    std::cerr << argv[0] << ": Exception: " << ex.what() << std::endl;
  }
  return returnCode;
}

// File reading method of scanning all the bytes in a file
#include <fstream>
auto fileReadMethod(const char* inputFileName) -> unsigned int {
    unsigned int census = 0;

    unsigned long prime;
    std::ifstream primesInput(inputFileName, std::ios::binary);
    while (primesInput.read(reinterpret_cast<char*>(&prime), sizeof(prime))) {
        ++census;
    }

    return census;
}

#include <fcntl.h>
#include <stdexcept>
#include <sys/mman.h>
#include <sys/stat.h>
#include "systemexception.hpp"

// Count the number of primes in the file by memory mapping the file
auto memoryMapMethod(const char* inputFilename) -> unsigned int  {

    int fd = ::open(inputFilename, O_RDONLY | O_CLOEXEC, 0);
    if (fd < 0) throw SystemException{};

    struct stat stats;  //NOLINT
    if (::fstat(fd, &stats) < 0) throw SystemException{};

    size_t len = stats.st_size;
    void* mappedArea = ::mmap(nullptr, len, PROT_READ, MAP_FILE | MAP_PRIVATE, fd, 0L);
    if (mappedArea == MAP_FAILED) throw SystemException{};
    auto* primes = static_cast<unsigned long*>(mappedArea);
    unsigned int countOfPrimes = len/sizeof(unsigned long);

    unsigned int census = 0;
    for (auto* p = primes; p != primes + countOfPrimes; ++p) {
        ++census;
    }
    if (countOfPrimes != census) throw std::runtime_error{"Number of mapped primes mismatch"};
    return countOfPrimes;
}

// Utility for stop watch timinh
#include "stopwatch.hpp"

using std::chrono::steady_clock;
using std::chrono::duration;

auto StopWatch::read() const -> double {
  steady_clock::time_point stopwatch_stop = steady_clock::now();
  steady_clock::duration time_span = stopwatch_stop - start;
  return duration_cast< duration<double> >(time_span).count();
}

And the stopwatch header ….

#ifndef STOPWATCH_HPP
#define STOPWATCH_HPP
#include <chrono>

class StopWatch {
 public:
  StopWatch() : start {std::chrono::steady_clock::now()} { }

  void reset() { start = std::chrono::steady_clock::now(); }

  [[nodiscard]] auto read() const -> double;

 private:
  std::chrono::steady_clock::time_point start;

};

Compile the code with C++20. Use it as you will. I’d appreciate some credit, but don’t insist on it. For the more legal types, apply this license:

@copyright 2022 Glen S. Dayton.  Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following
conditions:

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO
THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL
THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF
CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.

Do not change the terms of this license, nor make it more restrictive.

Parenthetical Note

Power of 2ISO/IEC-8000-13Approximate Power of 10Prefix
121010241kibi10310001kilo
222010242mebi10610002mega
323010243gibi10910003giga
424010244tebi101210004tera
525010245pebi101510005peta
626010246exbi101810006exa
727010247zebi102110007zetta
828010248yobi102410008yotta
Prefixes

Memory has historically been measured in powers of 1024 (210), but disk space in disk space in powers of 1000 (103) — meaning a kilobyte of disk space is 1000 bytes but a kilobyte of memory is 1024 bytes. In 2008 ISO and IEC invented new prefixes for the binary powers — kibi..yobi. I have yet to see the ISO/IEC prefixes in any advertisements for memory or storage. Human language, especially English, is wonderful in its overlaying of meaning depending on context.

Insecurity

Caribou automatically identified as “wolf coyote”

As you have noticed, I don’t post very often, so I am gratified that so many people have subscribed. I do make an effort to keep the usernames secure, encrypted, and I will never sell them. My limitation is I depend upon my provider to keep their servers secure. So far they have proven themselves competent and secure. I use multi-factor authentication to administer the site.

Too bad the rest of the world doesn’t even take these minimal measures. Just recently my personal ISP scanned for my personal email addresses “on the dark web”. To my pleasant surprise, they did a thorough job, but to my horrific shock, they found my old email addresses and cleartext passwords. I was really surprised that my ISP provided me with the links to the password lists on the dark web. I was able to download them, which were files of thousands of emails and cleartext passwords from compromised web sites. I destroyed my copies of the files so no one could accuse me of hacking those accounts. I was lucky my compromised accounts were ones I no longer used and I could just safely delete the accounts. In short order, my ISP had delivered three shocks to me:

  1. My ISP delivered lists of usernames and passwords of other people to me.
  2. The passwords were stored in cleartext.
  3. Supposedly reputable websites did not have sufficient security to prevent someone from downloading the password files from the various website’s admin areas.

I guess that last item shouldn’t be a surprise because in #2 the websites actually stored the unencrypted password. Perhaps this wouldn’t bother me so much if the principles for secure coding were complicated or hard to implement.

If you think security is complicated, you’re not to blame. The book on the 13 Deadly of Sins of Software Security became the 19 Dead Sins in later editions, and now the book is up to the 24 Deadly Sins. An entire industry exists to scare you into hiring consulting services and buying their books. Secure software, though, isn’t that complicated, but it has a lot of details.

Let’s start with your application accepting passwords. First rule, which everyone seems to get, is don’t echo the password when the user enters it. From the the command line use getpass() or readpassphrase(). Most GUI frameworks offer widgets for entering passwords that don’t echo the user’s input.

Next don’t allow the user to overrun your input buffers — more on that later. Finally, never store the password in an unencrypted form. This is the part where the various websites that exposed my username and passwords utterly failed. You never need to store the password — instead hash the password and store the hash. When you enter a password, the server, or the client (transmits the hash via an encrypted channel like TLS), hashes the password and the server compares it with its saved hashed password for your account. This is why your admin can’t ever tell you your own password because they can’t reverse the hash.

This is an example of the devil is in the details, where the security isn’t complicated, just detailed. The concept of password hashing is decades old. The user enters their password, and the system hashes it immediately, and compares the hash with what it has stored. If someone steals the system’s password file, they would need to to generate passwords that happen to hash to the same values in the password file.

Simple in concept, but the details will get you. Early Unix systems used simple XOR style hashing, so it was easy to create passwords that hashed to the same values, or even reproduce the original password. Modern systems use a cryptographic hash such a SHA2-512. Even with a cryptographic hash, though, you can get a collision of two different users who happen to use the same password. Modern systems add a salt value to your password. That salt value is usually a unique number stored with your username, so on most systems you need to steal both the password file and the file of salt values. Of course, if someone does break into your system, you’ll have the wisdom to set the permissions on the password and salt files so only the application owner can even see them and read them.

In short,

  1. Don’t echo sensitive information
  2. Don’t bother storing the unencrypted password
  3. Protect the hashed passwords.

We’re straying into systems administration and devops, so let’s get back to coding.

All of the deadly sins have fundamental common roots:

Do not execute data.

When you read something from the outside world, whether from a file, stream, or socket, don’t execute it. When you accept input from the outside world, think before you use it. Don’t allow buffer overruns. Do not embed input directly into a command without first escaping it or binding it to a named parameter. We all know the joke:

As a matter of fact my child is named
“; DELETE * FROM ACCOUNTS”

A good way to avoid executing data, is

Do not trespass.

Do not trespass” means don’t refer to memory you may not own. Don’t overrun your array boundaries, don’t de-reference freed memory pointers, and pay attention to the number of arguments you pass into functions and methods. A common way of breaking into a system is overrunning an input buffer located in local memory until it overruns the stack frame. The data getting pushed into the buffer would be executable code. When the overrun overlaps the return pointer of the function, it substitutes an address in the overrun to get the CPU to transfer control to the payload in the buffer. A lot of runtime code is open source, so it just takes inspection to find the areas of code to exploit this type of vulnerability. Modern computer CPUs and operating systems often place executable code in read-only areas to protect against accidental (or malicious) overwrites, and may even mark data areas as no-execute — but you can’t depend upon those features existing. Scan the database of known vulnerabilities at https://cve.mitre.org/cve/ to see if your system needs to be patched. Write your own code so it is not subject to this vulnerability.

Buffer overruns are perhaps the most famous of the data trespasses.

With C++ it is easy to avoid data trespasses. C++ functions and methods are strongly typed so if you attempt to pass the wrong number of arguments, it won’t even compile. This avoids a common C error of passing an inadequate number of arguments to a function so the function accesses random memory for missing arguments.

Despite its strong typing C++ requires care to avoid container boundary violations. std::vector::operator[] does not produce an exception when used to access beyond the end of a vector, nor does it extend vector when you write beyond the end of the vector. std::vector::at() does produce exceptions on out of range accesses. Adding the end of the array with std::vector::push_back() may proceed until memory is exhausted or an implementation defined limit is reached. I’m going to reserve memory management for another day. In the meantime, here is some example code demonstrating the behavior of std::vector:

// -*- mode: c++ -*-
////
// @copyright 2022 Glen S. Dayton. Permission granted to copy this code as long as this notice is included.

// Demonstrate accessing beyond the end of a vector

#include <algorithm>
#include <cstdlib>
#include <iostream>
#include <iterator>
#include <stdexcept>
#include <vector>

using namespace std;


int main(int /*argc*/, char* argv[]) {
  int returnCode = EXIT_FAILURE;

  try {
    vector< int> sample( 10,  42 );

    std::copy( sample.begin(),  sample.end(),  std::ostream_iterator< int>(cout,  ","));
    cout << endl;

    cout << "Length " << sample.size() << endl;
    cout << sample[12] << endl;
    cout << sample.at( 12 ) << endl;
    cout << "Length " << sample.size() << endl;

    cout << sample.at( 12 ) << endl;

    returnCode = EXIT_SUCCESS;
  } catch (const exception& ex) {
    cerr << argv[0] << ": Exception: " << typeid(ex).name() << " " << ex.what() << endl;
  }
  return returnCode;
}

And its output:

42,42,42,42,42,42,42,42,42,42,
Length 10
0
/Users/gdayton19/Projects/containerexample/Debug/containerexample: Exception: St12out_of_range vector

C++ does not make it easy to limit the amount of input your program can accept into a string. The The stream extraction operator, >>, does pay attention to a field width set with the stream’s width() method, or the setw manipulator — but stops accepting on whitespace. You must use a getline() of some sort to get a string with spaces, or use C++17’s quoted string facility. Here’s an example of the extraction operator>>:

// -*- mode: c++ -*-
#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <limits>
#include <stdexcept>
#include <string>

using namespace std;


int main(int /*argc*/, char* argv[]) {
  int returnCode = EXIT_FAILURE;
  constexpr auto MAXINPUTLIMIT = 40U;
  try {
    string someData;
    cout << "String insertion operator input? " << flush;
    cin >> setw(MAXINPUTLIMIT) >> someData;
    cout << endl << "  This is what was read in: " << endl;
    cout << quoted(someData) << endl;

    // Discard the rest of line
    cin.ignore(std::numeric_limits<std::streamsize>::max(), '\n');

    cout <<  "Try it again with quotes: " << flush;
    cin >> setw(MAXINPUTLIMIT) >> quoted(someData);  
    cout << endl;

    cout << "  Quoted string read in: " << endl;
    cout << quoted(someData) << endl;
    cout << "Unquoted: " << someData <<  endl;

    cout << "Length of string read in: " << someData.size() << endl;

   returnCode = EXIT_SUCCESS;
  } catch (const exception& ex) {
    cerr << argv[0] << ": Exception: " << ex.what() << endl;
  }
  return returnCode;
}

And a some sample output from it:

String insertion operator input? The quick brown fox jumped over the lazy dog.

  This is what was read in: 
"The"
Try it again with quotes: "The quick brown fox jumped over thge lazy dog."

  Quoted string read in: 
"The quick brown fox jumped over thge lazy dog."
Unquoted: The quick brown fox jumped over thge lazy dog.
Length of string read in: 46

The quoted() manipulator ignores the field width limit on input.

You need to use getline() to read complete unquoted strings with spaces. The getline() used with std::string, though, ignores the field width. Here is some example code using getline():

// -*- mode: c++ -*-
#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <stdexcept>
#include <string>

using namespace std;

int main(int /*argc*/, char* argv[]) {
  int returnCode = EXIT_FAILURE;
  constexpr auto MAXINPUTLIMIT = 10U;
  try {
    string someData;
    cout << "String getline input? " << flush;
    cin.width(MAXINPUTLIMIT);   // This version of getline() ignores width.
    getline(cin, someData);
    cout << endl << "   This is what was read in: " << endl;
    cout << quoted(someData) << endl;
  
   returnCode = EXIT_SUCCESS;
  } catch (const exception& ex) {
    cerr << argv[0] << ": Exception: " << ex.what() << endl;
  }
  return returnCode;
}

And a sample run of the above code:

String getline input? The rain in Spain falls mainly on the plain.

   This is what was read in: 
"The rain in Spain falls mainly on the plain."

Notice the complete sentence was read in even though the field width was set to only 10 characters.

To limit the amount of input, we must resort to the std::istream::getline():

// -*- mode: c++ -*-
#include <cstdlib>
#include <cstring>
#include <iomanip>
#include <iostream>
#include <stdexcept>
#include <string>

using namespace std;

int main(int /*argc*/, char* argv[]) {
  int returnCode = EXIT_FAILURE;
  constexpr auto MAXINPUTLIMIT = 10U;

  char buffer[MAXINPUTLIMIT+1];
  memset(buffer,  0,  sizeof(buffer));

  try {
    cout << "String getline input? " << flush;
    cin.getline(buffer, sizeof(buffer));

    cout << endl << " This is what was read in: " << endl;
    cout << "\"" << buffer<< "\"" << endl;
  
   returnCode = EXIT_SUCCESS;
  } catch (const exception& ex) {
    cerr << argv[0] << ": Exception: " << ex.what() << endl;
  }
  return returnCode;
}

And its sample use:

String getline input? I have met the enemy and thems is us.

 This is what was read in: 
"I have met"

Notice the code only asks for 10 characters and it only gets 10 characters. I used a plain old C char array rather than a fancier C++ std::array<char, 10> because char doesn’t have a constructor, so its values of the array thus constructed are indeterminant. An easy way to make sure a C style string is null terminated is to fill it with 0 with a memset(). Of course, you could fill the array entirely with fill() from <algorithm>, but sometimes the more direct method is lighter, faster, and more secure.

Global Warming and Java

Mile long iceberg

I’ve been losing the Java versus C++ argument for a long time.  Just look at the latest Tiobe Index. Even more disturbing are the languages in most demand in job listings.

Now I beg you for the sake of the planet, reconsider your use of Java for your next project.  Just consider how much electricity spent globally on computers.  Right now probably about 10% of the world’s energy goes to powering our computers (IT Electricity Use Worse than You Thought/). Some expect IT energy use to grow to 20% of mankind’s energy use by 2025.

One of the most bizarre arguments I’ve heard for the use of Java is that it is fast as C/C++.  If you consider half to a quarter as fast as C/C++ as the same, or fail to consider that you pay Java’s “Just-in-Time” compilation every time you run the application, then that argument is correct. Now consider what that slowness means.

(Look at https://benchmarksgame-team.pages.debian.net/benchmarksgame/which-programs-are-fast.html)

Some Numbers

(from https://en.wikipedia.org/wiki/World_energy_consumption)

10% of the world’s energy comes to about 15.75 petawatt-hours/year.  Your typical computer consumes about 120 watts.  An typical CPU takes about 85 watts, with the remainder consumed by memory, drives, and fans. In my calculations I’m not going to count the extra power needed to cool the machine because many machines merely vent their heat to the environment. 100% of the computer’s power consumption exhausts as heat. A computer converts no energy to mechanical energy.

According to the Tiobe index, Java powers 16% of the IT world. Let’s be generous and assume Java is only half as fast as C/C++, so that means 8% * 15.75 petawatt-hours/year = 1.2 petawatt-hours could have been saved per year. That’s about a 1,800 million metric tons of carbon dioxide.

What to do next

Despite the evidence, I doubt we can unite the world governments in banning Java. We can, though, wisely choose our next implementation language.

A long, long time ago, seemingly in a galaxy far, far way, I wrote my first program in Fortran 4 using punch cards.  Imagine my surprise when 40 years later a major company sought me out to help them with their new Fortran code.  They were still using Fortran because their simulations and analytics didn’t cost much to run in the cloud.  Besides being the language the engineers knew and love, it was less expensive by an order of magnitude in running it in comparison to other languages.

They had me convert much of the code to C/C++.  C/C++ was not as fast as the Fortran just because C/C++ allows the use of pointers and aliasing so the compiler can’t make the same assumptions as Fortran. Modern Fortran has a lot of new features to make it friendlier to engineers than the old Fortran 4, but frankly, it was a little like putting lipstick on pig. Object-orientation and extensible code is just a little difficult in Fortran.

Looking back at the benchmark game (previously mentioned https://benchmarksgame-team.pages.debian.net/benchmarksgame/which-programs-are-fast.html), some benchmarks actually ran faster written in C/C++ than in Fortran.  In fact, Fortran only ranked #5.

Looking at the C/C++ programs, though, many of them gained their speed through the use of inline assembly, so the benchmark game isn’t a fair measure. Looking at the #2 entry, though, Rust, shows something quite different.  The same applications written in Rust almost matched the C/C++ speed — but the Rust applications used no inline assembly, and were written in native Rust. The Rust applications looked elegant and clear, and at the time I didn’t even know Rust.

Every language teaches you a new way to solve problems.  Modern C++ is beginning to behave like Fortran in its accretion of features. You can use any programming paradigm in C++, and unfortunately programmers choose to do so, but you need to be a language lawyer to effectively write and use the features. In C++ it is easy to write not just unsafe code, but broken code.  For example:

char *imbroken(std::string astring) { return astring.data(); }

If you’re lucky your compiler will flag such monstrosities.

Rust, though, only allows you to “borrow” pointers, and you can’t pass them out of scope.  Rust forces you to account for every value a function returns.  Rust doesn’t have exceptions, but that’s intentional.  You must check every error return.  Take a look at it: https://www.rust-lang.org/

The new C++ is wonderful, but I’m losing my patience at intricate details.  I’m finding Rust frustrating, but I’m finding it re-assuring that if it compiles I’m miles ahead in having confidence that it is correct, secure, and fast. Someone even has a project to refactor the Linux kernel in Rust.

Offside!

Shipboard

 

A friend from college found my blog, and to my delight made some suggestions. I had to promise, though, to include a diatribe against “offside-rule” languages, scripting, and automatic memory allocation. I may never again get a job writing Python or Go applications, but here I go…

Offside-rule languages, such as Python and F#, use whitespace indentation to delimit blocks of statements. Its a nice clean syntax and a maintenance nightmare. I would have suffered less in my life without the hours spent deciphering the change on logic that cutting and pasting code between different indentation levels.  It’s especially bad when you’re trying to find that change in logic that someone else induced with their indentation error.

Taking it to an extreme the humorous people Edwin Brady and Chris Morris at the University of Durham created the language of Whitespace (https://en.wikipedia.org/wiki/Whitespace_(programming_language) (the wikipedia tag is prettier than the official page which only seems available on the Wayback Machine (http://archive.org/web/).

For full disclosure, I do use Python when I’m playing around with Project Euler (https://projecteuler.net/). It is the ideal language for quick number theory problems.  In a professional context Python has proven to be a nightmare starting with the compiler crashing with segmentation faults on what I thought were simple constructs, lack of asynchronous and multi-threaded features (try implementing an interactive read with a timeout, or fetching the standard and error output from a child process).  Complete the nightmare with a lack of compatibility between Python releases.

How To Get a Legacy Project Under Test.

You’re smart, so I’ll just give the outline and let you fill in the blanks:

0.  Given: you have a project of 300K to millions of lines of code largely without tests.

1.  Look at your source control and find the areas undergoing the most change.  Use StatSVN’s heatmap with Subversion  With Perforce, just look at the revision numbers of the files to detect the files undergoing the most change. With git, use gource or StatGit.  The areas under the most change are the areas you want to refactor first.

2.  In your chosen area of code, look at the dependencies.  Go to the leaves of the dependency tree of just that section of code.  Create mock function replacements for system functions and other external APIs, like databases and file i/o, that the leaf routine use.

3.  Even at this level, you’ll find circular dependencies and compilation units dependent on dozens of header files and libraries.  Create dummy replacements for some of your headers that aren’t essential to your test.  Use macro definitions to replace functions — use every trick in the book to get just what you want under test.   Notice so far you haven’t actually changed any of the code you’re supposed to fix.  You may spend a week or weeks to get to this point dependency on the spaghetti factor of the code.  Compromise a little — such as don’t worry about how to simulate an out-of-memory condition at first.  Hopefully you’ll start reaching a critical mass where it gets easier and easier to write tests against your code base.

4.  Now you get to refactor.   Follow the Law of Demeter.  Avoid “train wrecks” of expressions where you use more than one dot or arrow to get at something.  Don’t pass all of object when all it needs is a member.    This step will change the interfaces of your leaf routines, so you’ll need to go up one level in the dependency tree and refactor that — so rinse and repeat at step 3.

5.  At each step in the process, keep adding to your testing infrastructure.  Use coverage analysis to work towards 100% s-path coverage (not just lines or functions).  Accept you’re not going get everything at first.

What does this buy you?    You can now add features and modify the code with impunity because you have tests for that code.  You’ll find the rate of change due to bug fixes disappears to be replaced with changes for new salable features.

On the couple of projects where I applied this methodology the customer escalation rate due to bugs  went from thousands a month to zero.  I have never seen a bug submitted against code covered with unit tests.

The Encryption Wars aren’t Over Yet

Remember the Clipper Chip? It was Al Gore’s approved encryption chip that the government wanted to insert into every digital communications device that would allow the government to eavesdrop on criminals and everyone else’s conversations with a court order. The Clipper Chip finally faded away because of lack of public adoption and the rise of other types of encryption not under government control.  We never did resolve the debate on whether the government should even be trying to do that sort of eavesdropping.

Sink_Clipper_campaign

Now the government is back at it again. The Burr-Feinstein Bill (https://assets.documentcloud.org/documents/2797124/Burr-Feinstein-Encryption-Bill-Discussion-Draft.pdf) proposes to criminalize people like me who refuse to aid the government in hacking into a phone.  Australia, the United Kingdom, Canada, and other countries already have similar laws.  The UK has already sentenced several people to prison for not revealing encryption keys.

Fortunately at the moment the information locked inside my own head is not accessible to the government or organized criminals.  Once I write some notes down on my tablet, though, even though my tablet is encrypted, the government can force someone else to hack my tablet.   If my own government can do it, then presumably organized crime and foreign governments can also do it.  In the aforementioned countries, they don’t even need to hack.  They will send me to jail if I fail to reveal my encryption keys.

Now as I am not a dissident nor a cybercriminal, I don’t really have much to worry from the government — but I do buy things online, and I do some banking online.  I also sometimes negotiate for contracts with the government.  In other words, I have lots of legitimate information I want to keep private, even from the government — and that’s on a good day.  Imagine the problems I would have if I were a dissident (such as a Republican GM car dealer).

If the government actually acted responsibly all of the time, perhaps we wouldn’t have much to worry about.  We live in a harsher world than that, though.  A small minority of officials are corrupt, and in addition, cybercriminals, terrorist organizations, and foreign agencies will attempt to exploit the same loopholes our government has coerced.

The U.S. position will have consequences.  Nations that value privacy and the rights of their citizens will refuse to do cyber business with U.S. companies, and the beacon of democracy will shine from some other shore.  Our economy will begin to revert to pre-internet days as people lose more trust in the net.  If the government can break into your phone, then a well-healed terrorist organization can break into a power plant operator’s phone, steal his keys, and gain control of the power plant.  That’s just one example.

Compromise is not possible.  The problem is too big.  If you make a phone with a backdoor, then all phones of the same model and version are equally vulnerable.  No one will buy a U.S. designed phone.  If you break into one, then you can break into them all.

Given anyone with a little sense of operational security is not going to put anything on a phone more sensitive than a grocery list, any claim a phone might have value in an investigation is just a fishing expedition. Even if the phone belongs to a terrorist or a child pornographer, we must treat it as a brick. Breaking into a phone renders at least that version of the phone vulnerable for everybody with the same type of phone.

Everyone should e-mail Senators Feinstein and Burr and tell them that the new encryption laws compromise our freedoms.  This is so serious that this law places us on the edge of a new Dark Age.  I mourn that the United States is the agent of this dimming of the light of liberty.

Everyone needs to get their own encryption key.  Don’t depend on the one in your phone or tablet.  Comodo.com offers free e-mail certificates.  Of course, Comodo is generating the private key, so if the government coerces them to save the key its actually worse than having no key, but it is a start.  Just get started on your own encryption and signing.  If everyone digitally signs their e-mail then its easy to filter spam.

Graduate to the next level and generate your own PGP key, and upload it to one of the public key servers.  You’ll need to get an e-mail client that understands PGP keys but you’ll have absolute security.  I use Mynigma on a Mac.  Get it from the Apple Appstore. Get started in this and learn about PGP keys before your government makes it illegal.

I wanted this to be a coding blog, but this encryption issue is one of the most important technical issues of our entire civilization.  As a coder, you can do your utmost to

  •  Write secure code.  Know the CERT coding guidelines.
    You can’t add security after the fact.  Firewalls, WAFs and the like are just security theater.
  • Always use a secure protocol on external interfaces.
  • Sign your code.
  • Sign your email.
  • Encrypt your storage.

Everyone tests. Test everything. Use unit tests.

Over the past 40 years I’ve noted that every project with a large QA staff was a project in trouble. Developers wrote code and tossed it over the fence for QA to test. QA would find thousands of defects and the developers would fix hundreds. We shipped with hundreds of known defects. After a few years the bug database would have tens of thousands of open bugs — which no one had time to go over to determine if they were still relevant. The bug database was a graveyard.

Fortunately I’ve had the joy and privilege of working on a few projects where everyone tests. I think those projects saved my sanity. At least I think I’m sane. In those test oriented projects we still had a small QA department, but largely they checked that we did the tests, and sometimes they built the infrastructure for the rest of us to use in writing our own tests. Probably even more importantly, the QA people were treated as first class engineers, reinforced by every engineer periodically took a turn in QA. In those test oriented projects we detected even more bugs than the big QA department projects, but shipped with only a handful of really minor bugs. By minor, I mean of the type where someone objected to a blue colored button, but we didn’t want to spend the effort to make the button color configurable. Because the developers detected the bugs as they wrote the code, they fixed the bugs as they occurred. Instead of tens of thousands of open bugs, we had a half dozen open bugs.

Testing as close as possible to writing of the code, using the tests to help you write the code, is much more effective than the classic throw it over the fence to the QA department style. On projects with hundreds of thousands of lines of code, the large QA departments generally run a backlog of tens of thousands of defects, while the test-driven projects with the same size code base, run a backlog of a couple of bugs.

This observation deserves its own rule of thumb:

A project with a large QA department is a project in trouble.

Almost everyone has heard of test driven development, but few actually understand unit test. A unit test isn’t just a test of a small section of code — you use a unit test while you write the code. As such it won’t have access to the files, network, databases, of the production or test systems. Your unit tests probably won’t even have access to many of the libraries that other developers are writing concurrently with your module. A classic unit test runs just after you compile and link with just what you have on your development machine.

This means that if your module makes reference to a file or data database or anything else that isn’t in your development environment, you’ll need to provide a substitute.

If you’re writing code from scratch, getting everything under test is easy. Just obey the Law of Demeter( http://www.ccs.neu.edu/home/lieber/LoD.html ). The Law of Demeter, aka the single dot rule, aka Principle of Least Knowledge, helps ensure that the module you’re writing behaves well in changing contexts. You can pull it out of its current context and use it elsewhere. Just an important, it doesn’t matter what the rest of the application is doing (unless the application just stomps on your module’s memory), your module will still behave correctly.

The Law of Demeter says that your method or function of a class can only refer to variables and functions defined within the function, or to its class or super class, or passed into it via its argument list. This gives you a built-in advantage of managing your dependencies. Everything your function needs can be replaced so writing unit tests becomes easy.

Take a look at these example classes:

class ExampleParent {
protected:
    void methodFromParentClass(const *arg);
};


class ExampleClass : public ExampleParent {
public:
    void method(const char *arg, const Animals &animal);

    std::ostream& method(std::ostream& outy, const char *arg, unsigned int legs);
};

Now take a look at this code that violates the Law of Demeter:

void  ExampleClass::method(const char *arg, const Animals &animal)  {
    unsigned int localyOwned = 2;

    std::cout << arg << std::endl;         // bad

    if (animal.anAardvark().legs() != 4)   // bad
        methodFromParentClass(arg);    // okay

    // Another attempt to do the same things 
    // but the violation of data isolation is still present
    const Aardvark &aardvark = animal.anAardvark();
    if (aardvark().legs() != 4)                   // still bad
        methodFromParentClass(arg);    // okay

    localyOwned += 42;                       // okay

    // ... 
}

The primary problem is that if Animal is an object that refers to external resources, your mock object to replace it in a unit test must also replicate the Aardvark class. More importantly, in program maintenance terms, you’ve created a dependency binding on Animal when all you need is Aardvark. If Animal changes you may need to also modify this routine, even though Aardvark is unchanged. There is a reason why references with more than one dot or arrow is called a train wreck.

Of course for every rule there are exceptions. Robert “Uncle Bob” C. Martin in Clean Code (http://www.goodreads.com/book/show/3735293-clean-code)

    differentiates between plain old structs and objects. Structs may contain other structs so it seems an unnecessary complication to try to avoid more than one dot. I can see the point, but when I’m reading code, unless I have the header handy, I don’t necessarily know whether I’m looking at reference to a struct or a class. I compromise. If a struct is always going to be in a C-like primitive fashion, I declare it as a struct. If I add a function or constructor then I change the declaration to a class, and add the appropriate public, private and protected access attributes.

    Its been too long since my last post. In lieu of a coding joke, I’m including a link to my own C++ Unit Extra Lite Testing framework: https://github.com/gsdayton98/CppUnitXLite.git.

    To get it do a

      git clone https://github.com/gsdayton98/CppUnitXLite.git
    

    For a simple test program just include CppUnitXLite/CppUnitLite.cpp (that’s right include the C++ source file because it contains the main program test driver). Read the comments in the header file on suggestions on its use. Notice there is no library, no Google “pump” to generate source code, and no Python or Perl needed. Have fun and please leave me some comments and suggestions. If you don;t like the framework, tell me. I might learn something from you. Besides, I’m a big boy, I can take a little criticism.

Woodpecker Apocalypse

Weinberg’s woodpecker is here, as in the the woodpecker in “If builders built buildings the way programmers wrote programs, then the first woodpecker that came along would destroy civilization” (Gerald M. Weinberg, The Psychology of Computer Programming, 1971).

We’ve put our finances, health information, and private thoughts on-line, entrusting them to software written in ignorance.  Hackers exploit the flaws in that software to get your bank accounts, credit cards, and other personal information.  We protected it all behind passwords with arbitrary strength rules that we humans must remember.  Humans write the software that accepts your passwords and other input.  Now comes the woodpecker part.

Being trusting souls, we’ve written our applications to not check their inputs, and depend upon the user to not enter too much.  Being human, we habitually write programs with buffer overruns, accept tainted input, and divide by zero. We write crappy software.  Heartbleed and Shellshock and a myriad of other exploits use defects in software to work their evil.

Security “experts”, who make their money by making you feel insecure, tell you its impossible to write perfect software.  Balderdash.  You can write small units, and exercise every pathway in small units.  You have a computer after all.  Use the computer to elaborate the code pathways and then use the computer to generate test cases.  It is possible to exercise every path over small units.  Making the small units robust makes it easier to isolate what’s going wrong in the larger systems.  If you have two units that are completely tested, so you know they behave reasonably no matter what garbage is thrown at them, then testing the combination is sometimes redundant.  Testing software doesn’t need to be combinatorially explosive.  If you test every path in one module A and every path in module B, you don’t need to test the combination — except when the modules share resources (the evilness of promiscuous sharing is another topic).  Besides, even if we couldn’t write perfect software doesn’t mean we shouldn’t try.

Barriers to quality are only a matter of imagination rather than fact.  How many times have you heard a manager say spending the time or buying the tool was too much, even though we’ve known since the 1970s that bugs caught at the developers desk cost ten times less than bugs caught later.  The interest on the technical debt is usury.  This suggests we can spend a lot more money up front on quality processes, avoid technical debt, and come out money ahead in the long run.  Bern and Schieber did their study in the 1970s.  I found this related NIST report from 2000:

NIST Report

The Prescription, The Program, The Seven Steps

Programmers cherish their step zeroes.  In this case,  step zero is just making the decision to do something about quality.   You’re reading this so I hope you’ve already made the decision, but just in case, though, let’s list the benefits of a quality process:

  • Avoid the re-work of bugs.  A bug means you need to diagnose, test, reverse-engineer, and go over old code.  A bug is a manifestation of technical debt.  If you don’t invest in writing and performing the tests up front you are incurring technical debt with 1000% interest.
  • Provide guarantees of security to your customers.  Maybe you can’t stop all security threats, but at least you can tell your customers what you did to prevent the known ones.
  • Writing code with tests is faster than writing code without.  Beware of studies that largely use college student programmers, but studies show that programmers using test driven development are 15% more productive.  This doesn’t count the amount of time the organization isn’t spending on bugs.
  • Avoid organizational death.  I use a rule of thumb about the amount of bug fixing an organization does.  I call it the “Rule of the Graveyard Spiral”.  In my experience any organization spending more than half of its time fixing bugs has less than two years to live, which is about the time the customers, or sponsoring management lose patience and cut-off the organization.

So, lets assume you have made the decision to get with the program and do something about quality.  Its not complicated.    A relatively simple series of steps instill quality and forestall installing technical debt into your program.  Here’s a simple list:

  1. Capture requirement with tests.  Write a little documentation.
  2. Everyone tests.  Test everything.  Use unit tests.
  3. Use coverage analysis to ensure the tests cover enough.
  4. Have someone else review your code. Have a coding standard.
  5. Check your code into a branch with equivalent level of testing.
  6. When merging branches, run the tests.  Branch merges are test events.
  7. Don’t cherish bugs.  Every bug has a right to a speedy trial.  Commit to fixing them or close them.

Bear in mind that implementing this process on your own is different than persuading an organization to apply the process.  Generally, if a process makes a person’s job easier, they will follow it.  The learning curve on a test driven process can be steeper than you expect because you must design a module, class, or function to be testable.  More on that later. 

On top of that, you need to persuade the organization that writing twice as much code (the test and the functional code) is actually faster than writing just the code and testing later.  In most organizations, though, nothing succeeds like success.  In my personal experience the developers who learned to write testable code and wrote unit tests never go back to the old way of doing things.  On multiple occasions putting legacy code that was causing customer escalations under unit test eliminated all customer escalations.  Zero is a great number for number of bugs.

Details

  1. Capture requirements with tests.

Good requirements are quantifiable and testable.  You know you have a good requirement when  you can build an automated test for it. Capture your requirements in tests.  For tests on behavior of a GUI use a tool like Sikuli (http://www.sikuli.org/).  If you’re testing boot time behavior, use a KVM switch and a second machine to capture the boot screens.  Be very reluctant to accept a manual test.  Be very sure that the test can’t be automated.  Remember the next developer that deals with your code may not be as diligent as you so manual tests become less likely to be re-run when the code is modified.


Closely related to capturing your requirements in tests, is documenting your code.  Documentation is tough.  Whenever you write two related things in two different places, those two different things will get out of sync and become obsolete in relationship to the other.

It might as well be a law of configuration management:  Any collection residing in two or more places will diverge.

So put the documention and code in the same place.  Use doxygen (http://www.stack.nl/~dimitri/doxygen/) .  Make your code self documenting.  Pay attention to the block of documentation at the top of the file where you can describe how the pieces work together.  On complicated systems, bite the bullet and provide an external file that describes how it all works together.   The documentation in the code tends to deal with only that code and not its related neighbors, so spend some time describing how it works together.  Relations are important.

You need just enough external documentation to tell the next developer where to start.  I like to use a wiki for my projects.  As each new developer comes into the project I point them to the wiki, and I ask them to update the wiki where they had trouble due to incompleteness or obsolescence.  I’m rather partial to MediaWiki (https://www.mediawiki.org/wiki/MediaWiki).  For some reason other people like Confluence (http://www.atlassian.com/Confluence ).  Pick your own wiki at http://www.wikimatrix.org/ .

Don’t go overboard on documentation. Too much means nobody will read it nor maintain it so it will quickly diverge to having little relation to the original code.  Documentation is part of the code.  Change the code or documentation, change the other.

Steps 2 through 7 deserve their own posts.

I’m past due on introducing myself.  I’m Glen Dayton.  I wrote my first program, in FORTRAN, in 1972.  Thank you Mr. McAfee.   Since then I’ve largely worked in aerospace, but then I moved to the Silicon Valley to marry my wife and take my turn on the start-up merry-go-around.  Somewhere in the intervening time Saint Wayne V. introduced me to test driven development.  After family and friends, the most important thing I ever worked on was PGP.


Today’s coding joke is the Double Check Locking Pattern.  After all these years I still find people writing it.  Read about it and its evils at

C++ and the Perils of Double-Checked Locking

When you see the the following code, software engineers will forgive you if you scream or laugh:

static Widget *ptr = NULL;
static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;

// ...
if (ptr == NULL)
{
  pthread_mutex_lock(&lock);
    if (ptr == NULL)
       ptr = new Widget;
    pthread_mutex_unlock(&lock);
}
return ptr;

One way to fix the code is to just use the lock.  Most modern operating systems implement a mutex with a spin lock so you don’t need to be shy about using them:

using boost::mutex;
using boost::lock_guard;

static Widget *ptr = NULL;
static mutex mtx;

//...

{
    lock_guard<mutex> lock(mtx);
    if (ptr == NULL)
       ptr = new Widget;
}
return ptr;

Another way, if you’re still shy about locks, is to use memory ordering primitives.  C++11 offers atomic variables and memory ordering primitives.

#include <boost/atomic/atomic.hpp>
#include <boost/memory_order.hpp>
#include <boost/thread/mutex.hpp>
#include <boost/thread/locks.hpp>

class Widget
{
public:
  Widget();

  static Widget* instance();
private:
};
Widget*
Widget::instance()
{
  static boost::atomic<Widget *> s_pWidget(NULL);
  static boost::mutex s_mutex;

  Widget* tmp = s_pWidget.load(boost::memory_order_acquire);
  if (tmp == NULL)
  {
    boost::lock_guard<boost::mutex> lock(s_mutex);
    tmp = s_pWidget.load(boost::memory_order_relaxed);
    if (tmp == NULL) {
      tmp = new Widget();
      s_pWidget.store(tmp, boost::memory_order_release);
    }
  }
  return tmp;
}

If the check for the lock, though, occurs in a high traffic area, you may not want to pay the cost of flushing the data cache for every atomic check, so use a thread local variable for the check:

using boost::mutex;
using boost::lock_guard;

Widget*
Widget::instance()
{
    static __thread Widget *tlv_instance = NULL;
    static Widget *s_instance = NULL;
    static mutex s_mutex;

    if (tlv_instance == NULL)
    {
        lock_guard<mutex> lock(s_mutex);
        if (s_instance == NULL)
            s_instance = new Widget();
        tlv_instance = s_instance;
    }

    return tlv_instance;
}

Of course, everything is a trade-off. A thread local variable is sometimes implemented as an index into an array of values allocated for the thread, so it can be expensive.  Your mileage may vary.

Software Sermon

I’ve been accused of preaching when it comes to software process and quality, so I decided to own it — thus the name of my blog.

Our world is at a crossroads with ubiquitous surveillance and criminals exploiting the flaws in our software. The two issues go hand-in-hand.  Insecure software allows governments and criminal organizations to break into your computer, and use your computer to spy on you and others.  A lot of people think they don’t need to care because they’re too innocuous for government notice, and they don’t have enough for a criminal to bother stealing.

Problem is that everyone with an online presence, and everyone with an opinion has something to protect.  Thieves want to garner enough of your personal information to steal your credit.  Many bank online, access their health records online, and display their social life online.  Every government, including our own, at one time or another has suppressed what they thought was dissident speech.

So let’s talk about encrypting everything, and making the encryption convenient and powerful.  Before we get there, though, we have to talk about not writing crappy software.  All the security in the world does no good if you have a broken window.

My favorite language happens to be C++, so I’ll mostly show examples from that language.  Just to show problems are translatable into other languages I’ll toss in an example in Java.  I promise I will devote a entire future posting to why I hate Java, and provide the code to bring a Java server to its knees in less than 30 seconds.  Meanwhile I’ll also toss in examples in Java.  With every post I’ll try to include a little code.


Today’s little code snippet is about the use of booleans.  It actually has nothing to do with security and with me learning how to blog.  I hate it when I encounter the coding jokes

if (boolVariable == true || anotherBool == false) ...

It’s obvious that the author of that line didn’t understand evaluation of booleans.  When I asked about that line, the author claims “It’s more readable that way”.  Do me and other rational people a favor;  when creating a coding guideline or standard, never ever use “it’s more readable that way…”.  Beauty is in the eye of the beholder.  Many programmers actually expect idiomatic use of the language.  Know the language before claiming something is less readable than another.  In this particular case, the offending line defies logic.  What is the difference between

boolVariable == true

and

boolVariable == true == true == true ...

Cut to the chase and just write the expression as

if (boolVariable || ! anotherBool) ...

Believe it or not (try it out yourself by compiling with assembly output) the different styles make a difference in the generated code.  In debug mode the actual test of a word against zero gets generated with the Clang and GNU compilers.  Thankfully, the optimizing compilers will yield the same code.  It is helpful, though, to have the debug code close to the optimized code.


The above coding joke is related to using a conditional statement to set a boolean, for example:

if (aardvark > 5) boolVariable = true;

Basic problem here is you don’t know if the programmer actually meant  boolVariable = aardvark > 5  or did they mean

 boolVariable = boolVariable || aardvark > 5;

Write what you mean.