Coding, is? Fun!

Tuesday, August 12, 2008

Writing Technical Specifications

Training technical leads to write a technical specification document is challenging. A technical specification follows a functional specification from a business owner (typically in the process flow). I view a technical specification as a dialog between the technical lead and the development team (inward facing). Sometimes it is a dialog between the technical lead and a technical stakeholder in a client's organization (outward facing). In both these cases, a tech lead needs training on how to go about writing the spec.
Here is a key attitude shift - please do not consider the spec as something you have to write to get the process documents complete. The spec writing process will organize your thought process, and create a narrative for the solution path you choose. It is similar to writing a story - in the structure and the narrative.
Before you start writing a spec you should be clear about the meaning of solution architecture. What do you focus on when you approach a system? At what detail? How important is UML for this purpose?

What is Technical Architecture?

Let us consider the Indian Railways Online Reservation System - http://irctc.co.in. The Railways are going to create an online reservation system and they have invited you as the Architect. What do you do (other than demanding free tickets for the rest of your life)?

There are different views of the system - there is an end state and then you have to get from the current state to the end state. Here are a few views of the system:

a. Users need to see the list of trains, see availability and then book the tickets. How do you get the list of trains and add new trains to the system? You need to see the list of free seats - how do you get that information in a database? Thus there is a data model that you have to create.
b. You have to design your middle tier code such that it can handle tens of thousands of users. That requires caching some of the information that you get every time - for example, the list of trains between two destinations can probably be cached. You have to come up with a object model. You also need to decide about state management (such as session).
c. Your UI has to be responsive and cool - you need minimum HTTP requests and need to deal with popups or overlays. You need cross-browser compatibility. These are frontend design concerns.
d. You have to manage a security scheme that handles different roles - some users may have admin capabilities to add additional routes or trains. You also have to protect your site against cross-site scripting attacks or SQL Injection attacks. So there is a security model.
e. This application must perform well for tens of thousands of users. That means you have to design for performance and consider upfront the performance tests on each module. You also have to plan the number of servers you will require on the website. This is the area of performance testing and capacity planning. You also have to plan for sharing state aong different webservers. This is your deployment model.
f. You have to process payment by credit cards to different banks. This is the area of payment gateway integration. You may need to support multiple banks through a similar interface. It must be easy to add another bank. You also have to make sure that your data is secure (encrypted) if you comply with PCI or other standards.
g. You have to process RACs (Reservation After Cancellation) in a background job. Thus there are a list of supporting back office services that have to run to support the website. You have to identify them and suggest a model (workflow). You have to monitor that these services are working properly and get alerted when they do not.
h. You may have to choose technologies based upon the decisions above. That includes constraints such as your current development staff skills.
i. You need a network model because the application is available across the country to ticketing agents everywhere.
j. How would this system be tested? What are the environments you would use to deploy the system and what are the rules of deployment?

Done? Not yet.

The biggest issue you would face on this project is the rollout plan. One fine morning the application goes live and availability information should be available online for that day and going forward. This means there is a data migration from paper based systems (or a legacy digital system) to the new online system. How does that happen? It is within the architectural plan to come up with a solution for this problem.

Obviously the solution path is complex and you have to handle different views of the system. But the above information is what you will put together in a technical specification document.
What is the level of detail you go to? In High Level Design (HLD) documents (or architectural specifications) you do not need class diagrams unless essential to explain a solution.
You can
- describe a problem,
- suggest your approach,
- alternative approaches
- pros and cons with each approach
and be done with it. This has to be repeated for each "area of concern" as shown above.
In the Low Level Design (LLD) documents (or technical specifications) you would focus on detailed design for each module including UML diagrams.

I have mostly worked in services organizations in which there is a client technical stakeholder - so I usually use the HLD as a dialog between me and the client; and the LLD as a dialog between me and the team. The HLD and the functional spec give a good overview of the project to a development team.


Tips on the Document Creation

i. I always organize tech specs in a conversational style and freely use the terms "we", "I" or "you" - this sounds informal, but helps the narrative.

ii. I do not hesitate to put questions and possible alternatives in the document. It helps later when someone else needs to understand your thought process. I highlight unanswered questions in different versions using an yellow background.

iii.I create a template structure with the following format:
1. Business overview
2. Scope and out of scope list for the project
3. Overview of the document - with a list of key items that will be addressed
4. References - such as functional specs
5. Development Environment, link to coding conventions, tools, IDE, version of frameworks, source control, language to be used
6. Architecture Overview - standard diagram of the system so that people can understand what we are trying to achieve. You can include existing system architecture diagram in the case of reengineering projects.
7,8,... From here on you address specific concern areas. Each should be in a separate section.
I usually create the above template upfront with the concern areas also listed. Then I start drilling deeper.

iv. I do not create diagrams while writing the spec. I mention the diagrams in the document, but I draw them later - so that it does not interfere with my thought process. Usually people take time drawing using Visio or other tools. If you really want to draw it out to get a good idea, draw it on paper or draw it on a white board and photograph it. It does help to have a set of Visio templates available.

v. I usually have the following sections always available in every HLD document:
- Security
- Performance
- Unit Testing
- Health Monitoring
- Rollout Plan

Labels: ,

Saturday, August 09, 2008

MySql High Availability

I recently had the opportunity to work on a solution for MySQL High Availability. Since I had no idea how MySQL worked and what High Avail meant, it was a crash course for a week. At the end of the week I had a good idea about the “primitives” in database HA solutions. It was, surprisingly, simple to understand.
High Availability solutions are concerned with database services being up most of the time (or the “five 9” SLA – databases to be available 99.999%). Although the specific solutions vary across different vendors (such as Microsoft SQL Server, Oracle and MySQL), there is a common thread or a set of concepts cutting across vendors.
Any database solution is a service that runs different threads servicing connection requests. This service runs on a physical or virtual server. The service accesses data from a data storage. Usually, the data storage is in the file system. But it can also be part of a Storage Area Network (SAN). SANs are used in large enterprises and allow network devises to be used as local file systems. Either way, the service need not be aware of this distinction. As far as a service is considered, the data storage is local.
A normal production system, that does not have high availability may use a single database server that runs the service which accesses the local file system. If the service fails, someone needs to be notified. The System Administrator (SA) would then restart the service and check what went wrong. This means for a variable time, the database is down and so is the application.

MySQL Replication

For high availability, you necessarily need at least two servers or “nodes”. You can have more than two, but two is minimum. If one server or service fails, the other takes over.
This seems simple – once a server goes down, the SA can have the second server take over.
But there are two problems with this scenario
- The data in the first server needs to be available in the second. If it is not, the second server has older data and in a commercial system, that is unacceptable.
- The application is still pointing to the first server. We need to change the IP of the second server to take the place of the first, so that the application can continue to function. This is called Automatic IP failover

To solve the first problem, MySQL uses the Master/Slave configuration or MySQL Replication. When MySQL is configured, you create on server as a Master and the other as a Slave (there can be multiple slaves). MySQL Master internally takes care of replicating data at frequent intervals to the Slave. Thus, when the Master fails, you have most of the data in the Slave. There may be a small delat missing after the last synchronization, but that may usually be accaptable, depending on the replication time interval.
Note that MySQL handles this replication, not a separate service.
So, at this point, the SA gets notified on Master failure. SA converts Slave to Master and manually switches the IP of the Slave. Application can now start working while the SA fixes the Master.

Note: With Master/Slave, in websites which have mostly read-only access, you can have the Slave handle most of the reads, while the Master handles the reads and writes. This requires an application level change (like changing connection strings in PHP or .NET). Thus there is a level of load-balancing you can achieve.

General parameters for HA

Thus, you can see that running a High Availability solution needs you to understand four important parameters. Ask yourselves these four questions when looking at a clustering solution from any database vendor:
1. How do I get alerted when a service goes down? Most data centers have service-independent Alerting systems that would page a system administrator. There is a Linux service called the Heartbeat that can monitor the operation of any service in Linux.
2. Does the recovery or fail-over need manual intervention? If no, what is the time interval for which the service is broken? As you can see in Master/Slave, the answer is Yes, we do require manual intervention so that a SA can configure the Slave to be the new Master.
3. Is the data available completely after failure recovery? In Master/Slave, we already saw that there is a short loss of data. But this can be resynchronized manually using transaction logs in the Master.
4. How many nodes (servers) do I need to run a HA system? This corresponds to the infrastructure cost. Master/Slave can run with two nodes, while MySQL Clustering needs as many as 5 nodes to run a HA system. Of course this is a tradeoff, because the system recovery in a MySQL cluster does not require manual intervention and can happen in less thatn 3 seconds in case of failure.

There are other options for HA in MySQL such as MySQL+Heartbeat+DRBD and MySQL Clustering. I encourage you to explore these and apply the above four questions to each solution. In fact, you can use these for judging SQL Server and Oracle clustering too.

Labels: ,

Thursday, August 07, 2008

Insight of the day

Most of the professionals in this industry spend time proving to others that they are smart. That they can understand requirements or technology without asking too many questions or without showing ignorance. I think many projects would actually be more successful if we were just allowed to say we don't jack about what the client is talking about. Same holds through all ranks - if developers could tell project leads that they had no idea what was expected out of them, it would actually move projects faster - I am positive.

Labels: