Harshdeep 2.0

June 21, 2007

Reusing High Level Modules – Dependency Inversion Principle

Filed under: Design Patterns, Dev, Geek, Programming — harshdeep @ 8:14 am

It is easier to make a low-level module reusable, than a high level module. Firstly, a low-level module generally has clearer goals (do one thing and do it right), and wider usability (number of people who need a generic stack is much more than those who need a document indexer). Secondly, a low-level module has less dependencies on other modules. When one moves a module from one application to another, one also needs to move all the modules that it depends on and, since dependency is transitive, also the modules that those modules depend on and so on. So, higher the efferent coupling of a module, harder it is to reuse it.

Note that we are talking about reusing a module and not just copying chunks of code. Code copying is not code reuse.

code copying … comes with a serious disadvantage: you own the code you copy! If it doesn’t work in your environment, you have to change it. If there are bugs in the code, you have to fix them. If the original author finds some bugs in the code and fixes them, you have to find this out, and you have to figure out how to make the changes in your own copy. Eventually the code you copied diverges so much from the original that it can hardly be recognized. The code is yours. While code copying can make it easier to do some initial development; it does not help very much with the most expensive phase of the software lifecycle, maintenance.

I prefer to define reuse as follows. I reuse code if, and only if, I never need to look at the source code (other than the public portions of header files). I need only link with static libraries or include dynamic libraries. Whenever these libraries are fixed or enhanced, I receive a new version which I can then integrate into my system when opportunity allows.

Now, to make my high level component reusable, I need to remove it’s dependencies on low-level modules. This is one of the motivations behind Dependency Inversion Principle put forward by Robert C. Martin in another of his brilliant papers on Design Patterns.

Consider the implications of high level modules that depend upon low level modules. It is the high level modules that contain the important policy decisions and business models of an application. It is these models that contain the identity of the application. Yet, when these modules depend upon the lower level modules, then changes to the lower level modules can have direct effects upon them; and can force them to change.

This predicament is absurd! It is the high level modules that ought to be forcing the low level modules to change. It is the high level modules that should take precedence over the lower level modules. High level modules simply should not depend upon low level modules in any way.

Moreover, it is high level modules that we want to be able to reuse. We are already quite good at reusing low level modules in the form of subroutine libraries. When high level modules depend upon low level modules, it becomes very difficult to reuse those high level modules in different contexts. However, when the high level modules are independent of the low level modules, then the high level modules can be reused quite simply.

He defines the Dependency Inversion Principle as

a) High level modules should not depend upon low level modules. Both should depend upon abstractions.

b) Abstractions should not depend upon details. Details should depend upon abstractions.

    Here’s an example from the same paper. In the traditional layered design as below, a change in the lowest level Utility Layer can affect the highest level Policy Layer.

    Instead of letting each layer depend directly on the one underneath it, I can make each of the higher level layers use the lower layer through an interface (abstract class) that the actual layer implements (derives from).

    Now none of the higher level layers will be affected if any of the lower level layers change, as long as they keep abiding to their respective interfaces. If I switch to a third party library for any of the lower level layers, I can write an Adapter to make it confirm to its interface, thereby not affecting the higher level layer at all.

    In many simple cases, DI can also be achieved through callbacks. A very common example is when a module provides APIs to allow the application to set its own memory allocation and de-allocation callbacks. The application may do this when it wants to use a heap optimized for small memory allocations, or if it wants to keep track of total memory allocated.

    However, I think there are cases when it’s alright if you don’t follow DIP.

    1. Lower-level module is highly stable. If you know that the lower-level module won’t change much during the life time of the depending module, and you are never going to have to replace it, even when the depending module is reused in another application, there is no harm in depending directly on it.
    2. Lower-level module is highly specific. Again, if you’ll never have to replace the lower-level module with another, you can depend directly on it.
    3. Performance is crucial. Use of abstract classes and virtual functions has a performance penalty. So it’s not advisable for the performance critical parts of the application. However, one can consider using plain function callbacks to achieve DI in such cases, as in the allocation/de-allocation routine example above.

    June 6, 2007

    Why you should learn a new programming language

    Filed under: Dev — harshdeep @ 7:18 am

    As a computer scientist, one must avoid getting stuck with one programming language, even if it’s the “best programming language ever” and you can use it to do everything you’d ever want to. By constraining oneself to a particular language for too long, one starts confusing “what is” with “what should be” and “what can be”.

    I’ve been using C++ for the last 7 years. I had the (primitive) Borland Turbo C++ compiler on my first computer that dad gifted me when I entered NSIT. I fell in love with it immediately and, besides brief forays into Java and VB, I’ve mostly stuck to it since then.

    It’s a wonderful language. I was developing software for desktop and mobile platforms (no web development), and it was sufficient for all my needs. Some of my tasks might have been completed much faster if I had used, say, Perl or Python, but C++ is like a trusted friend whom I’d always approach first. Other options were considered if there was a very significant gain in terms of productivity – significant enough to warrant the learning curve.

    But I don’t think this is the right approach in the long run. I should have gone through the learning curve in many of those situations. An engineer must continuously learn new tools to make himself more productive. Scott Hanselman puts it more directly by saying that you should learn one new language every year.

    I’m now learning ActionScript for my current project which is a Flex/Apollo application. In spite of the suffix “script”, it is a very powerful and flexible language. I learn something interesting every day. Some of the stuff that you can do in ActionScript will never be supported in C++. This doesn’t mean that C++ is archaic compared to ActionScript or that ActionScript has insufficient or unnecessary features. It just means that the two languages are used for different purposes, and have different styles of getting things done.

    A fundamental difference between ActionScript and C++ is that ActionScript is prototype-based while C++ is class-based. A class in ActionScript is internally very different from one in C++. In C++, a class definition is just a blueprint for the objects. In ActionScript, like other prototype-based languages, it is an object in itself and new objects are created by cloning it. This is what the reference says.

    Every class definition is represented by a special class object that stores information about the class. Among the constituents of the class object are two traits objects and a prototype object. One traits object stores information about the static properties of the class. The other traits object stores information about the instance properties of the class and serves as the primary mechanism for class inheritance. The prototype object is a special object that can be used to share state among all instances of a class.

    Look at the XML handling in ActionScript. Thanks to the E4X support, you can use XML as part of your code…

    … and work with it in intuitive ways using little code. For example mXml..name gives a list of names of all employees, and mXml.employee.(age<30).name gives the list of younger ones only. Here is an introduction to E4X in ActionScript.

    There are several other interesting things about ActionScript that will be new to somebody born and brought up with C++. And it’s probably the same with other languages as well. So, even if you don’t have an immediate need to use a new language, I feel it’s worth taking some time out to get your hands dirty with one. It’s fun and chances are that you’ll pick up something fundamental in the process.

    May 14, 2007

    Thou Shalt Rebase Thy DLL

    Filed under: Dev — harshdeep @ 8:36 pm

    A friend promised me a treat today if I could find a way to tell the base address at which a DLL has been loaded for a given process. (Dragon’s Den in Sector 15-A, Noida serves great chinese 🙂 )

    The short answer is – Use ListDlls from SysInternals.

    But thankfully, it took me some time before I could locate this nifty little tool. And while looking for the solution, I stumbled upon a lot of interesting information about DLL loading.

    For starters, this codeproject article concisely explains the preferred base address of a DLL, and how it affects the load time.

    Every executable and DLL module has a preferred base address, which identifies the ideal memory address where the module should get mapped into a process’ address space. When you build an executable module, the linker sets the module’s preferred base address to 0x00400000. For a DLL module, the linker sets a preferred base address of 0x10000000.

    The default preferred base addresses mentioned here are for Microsoft’s VC++ linker. They will most probably be different with Borland and other compilers. You can use the Dependency Walker to check the preferred base address of a DLL/exe.

    When the DLL is built, the addresses of its functions and global/static variables are hard-coded relative to the preferred base address. This works as long as the DLL can be loaded at that address.

    But what if two DLLs used by your executable have the same base address?

    If your application needs to load a DLL whose preferred load address conflicts with memory that’s already in use (such as by a previously-loaded DLL that had the same preferred load address), the operating system “rebases” the conflicting DLL by loading it at a different address that does not overlap and then by adjusting all addresses. The physical format of a .dll file includes relocation information that points to, for example, the target addresses of CALL and JMP instructions, and addresses that reference global/static variables (such as literal strings). All these addresses have to get revised if the operating system cannot load the DLL at its preferred load address.

    These address fixups slow down the loading of the DLL. And they put a penalty in pagefile usage as well. An old but highly relevant article by Ruediger Asche, Rebasing Win32 DLLs: The Whole Story, explains this very lucidly.

    Whenever a page of the DLL is removed from an application’s working set, the operating system will reload that page from the DLL executable file the next time the page is accessed.

    Of course, when a DLL is rebased, this scheme no longer works because the pages that contain relocated addresses differ from the corresponding pages in the DLL executable image. Thus, as soon as the operating system attempts to fix up an address when loading an executable file, the corresponding page is copied (because the section was opened with the COPY_ON_WRITE flag), all the changes are made to the copy, and the operating system makes a note that from now on the page is to be swapped from and to the system pagefile instead of the executable image.

    There are two potential performance hits in this setup: First, each page that contains an address to be relocated takes up a page on the system pagefile (which will, in effect, reduce the amount of virtual memory available to all applications); and second, as the operating system performs the first fixup in a DLL’s page, a new page must be allocated from the pagefile, and the entire page is copied.

    So base address conflicts are evil. How do you avoid them?

    One way is to manually assign suitable preferred base addresses to all DLLs at build time using the /BASE linker option.

    You can even take the strict approach to build the DLL with the /FIXED flag. Now, if it can’t load at its preferred base address, it won’t load at all.

    But that’s not it. You can change the base address of a compiled DLL as well. The RebaseImage function in Imagehlp.dll lets you do just that.

    Thiadmer has used this to calculate the base address of a DLL by hashing its name. This technique has a good probability of assigning non-conflicting base addresses to the DLLs, but they are still being rebased in isolation and there is no guarantee that their base addresses won’t conflict.

    This is where the EDITBIN utility provided by the Platform SDK comes in. You can use it (with /REBASE option) to rebase a set of DLLs to non-conflicting base addresses. It also uses the size of the DLLs to allot the base addresses, thereby ensuring an optimal distribution.

    There are a couple more things that you can do to make sure that your application fires up in no time. In his investigations on the costs of DLL loading, Ruediger Asche came up with some interesting conclusions.

    • All other things being equal, the size of the DLL does not matter; that is, the costs for loading a small DLL and a large DLL are pretty much equal. Thus, if possible, you should avoid writing a lot of small DLLs and instead write fewer large DLLs if load time is an issue for you. Note that this observation holds true over a very wide range of DLL sizes—when I ran the test on the huge binary DLL I mentioned earlier (the one with 15,000 pages), the load time did not differ very much from the load time for the small DLL that contains six pages total.
    • Rebasing the DLL incurs an overhead of about 600 percent on Windows NT and around 400 percent on Windows 95. Note, however, that this implies a great number of fixups (34,000 in the sample suite). For a typical DLL, the number is much smaller on the average; for example, in the debug version of MFC30D.DLL, which ships with Visual C++ version 2.x, there are about 1700 fixups, which is about 5 percent of the 34,000 fixups in the sample suite.
    • The single biggest factor that slows down the loading of DLLs is the location of the DLL. The documentation for LoadLibrary describes the algorithm that the operating system uses for locating the DLL image; a DLL located at the first search position (the current directory) loads in typically 20 percent or less of the time as the same DLL located deep down in the path loads. It is fairly obvious that the exact load time difference depends a lot on the length of the path, the efficiency of the underlying file system, and the number of files and directories that need to be searched.”

    Wrapping up the discussion finally with an interesting tidbit from Old New Thing about how Windows 95 used to rebase its DLLs in the memory starved conditions of the mid-90s.

    Create a free website or blog at WordPress.com.