This page describes how data is collected and processed. Because of the nature of the data that is made available, various problems in data collection and interpretation, as well as some arbitrary choices in what data to incorporate and how to aggregate it when multiple links are reported, the figures on the MINTS pages should not be used to conclude anything definite about any individual network or exchange. Only overall trends consistent across many sites should be noted.
A good example of the limitations of our analysis is exhibited by SWITCH, the Swiss academic and research network. The statistics for the external links that we chose to monitor show rather irregular growth rates between end of 2001 and mid-2007. Yet a more comprehensive measure, namely the total volume of data traffic leaving SWITCHlan (see SWITCH notes) shows a much more regular annual growth rate during this period of 55% (although it appears to consist of pairs of years, one with slow growth, and the second with a near doubling of traffic).
Sites that make traffic statistics publicly available are definitely not a representative sample of all those on the Internet. They tend to be Internet exchanges and academic and research network, with only a smattering of private companies or commercial service providers. Hence great care should be taken in drawing any conclusions about the overall state of the Internet from the analyses and data on this site. In particular, the growth rates shown here are dominated by those of Internet exchanges, whose traffic appears to be growing faster than of the entire Internet, as more carriers decide to exchange traffic at such places to minimize costs.
Data is collected from publicly available sources. Most comes in the form of MRTG or RRD graphs. Logfiles are also collected from sites that make them available. All of the information in the graphs and figures is collected from the public domain. MRTG graphs typically show transmission rates in bits/second, which is the same unit used in the traffic graphs on this page. For some sites with many links, a single traffic value is reported. When such a site does not report an aggregate traffic rate, this project obtains such a value by summing values for individual links.
Data is drawn from sources in order of resolution. In particular, log files are considered to have the best resolution, followed by daily, weekly, monthly, and yearly graphs. Sources with values that are suspicious for any reason are ignored.
Growth Rates are obtained by performing regression analysis on the data. We look for an exponential fit of the form y = a 10bx, where x is the day, and y is measured in bits/second.
We solve for this by first computing a least squares linear fit. All logarithms are base ten.
The annual growth rate (AGR) is the multiplicative increase in the volume of traffic over a year as given by the exponential curve, or
Thus AGR = 2 means that (on the smooth curve produced by the prescribed procedure) the volume of traffic doubles in a year, or grows 100%. AGR = 5 corresponds to traffic growing five-fold, or 400% per year.
Annual Growth Rate Graphs
The y-axis in the AGR graph measures the change in the incoming traffic of a site. Thus, 1 means no change over the time period. 2 means 100% growth, 3 means 200% growth, etc. Similarly, 1/2 means 50% reduction, 1/4 means 75% reduction, and so on. In the graph, vertical axis has a log base 2 scale. This scale is used to deflate very fast growth rates, and also to give more prominence to large declines.
The x-axis point of a graph is the 'representative' traffic of the site over the interval. This is obtained by taking the exponential fit determined above, and applying it to the midpoint of the days monitored over the interval. If a site was not monitored during the entire timescale due to becoming defunct or for any other reason, the midpoint of its life was used. Please not that the x-axis is also logarithmic. If a site happens to have differing values for incoming and outgoing traffic, the larger value is used.
The weighted average AGR is obtained from sites with growth rates g1 ... gk with representative traffic t1 ... tk by computing:
The MINTS project uses two main programs. One periodically collects traffic graphs, the other obtains numerical values for traffic rates from those graphs.
The publicly available graphs are all derived from databases that are more accurate than the data one can hope to derive from the graphs. However, aside from the sites that make the underlying datasets publicly available, this requires contacting appropriate people in each organization and arranging for continuous data feeds. Such a labor-intensive and error-prone approach does not scale. Hence automated decoding of the graphs is used. This is still a laborious process, since the large variety of formats that are produced by MRTG and RRD tools alone leads to substantial challenges in decoding the graphs properly. However, the configuration files that allow automated decoding of the graphs only have to be set up once for each site, and updated on the infrequent occasions when the output format, or topology of the network, changes.
The main programs used in the project are written in Java. Graphs are downloaded twice a day, and the data obtained from them is stored in a database. The result is that in many cases there is more data available at MINTS than is held by the sites being monitored (since the MRTG and RRD programs typically keep statistics for only 18 months or so).
There are a variety of problems that arise in data collection. They are easy seen by inspecting a random sample of the traffic pages for which URLs are provided in the Data report section of MINTS. Quite frequently sites cease updating their graphs due to intermittent problems, maintenance, or changes in network architecture. Additionally, the collection software of this project has at times experienced problems, resulting in the loss of data, although in most cases the gaps could be filled in later.
Sometimes a site will record a huge but spurious traffic volume as a result of some fault. When this happens, the graphs will a few extremely high and spurious values, with the correct values squeezed into the bottom part of the graph, where the precise values cannot be resolved.
Another problem has to do with aggregation. In the case of sites with multiple links, it is sometimes the case that not all links have data recorded at every point in time. This can result in massive jumps on the aggregated traffic plots that may not actually reflect significant changes in traffic rates
The nature of the graphs themselves also contribute another source of error. A large abnormal spike of traffic can result in a flatline, and essentially unusable data. The order of drawing also can play an important role. In general for MRTG graphs the incoming lines are drawn first, followed by the outgoing. The result of this ordering is that even if the values are an exact match, the rendering of the outgoing line over the incoming line may result in an additional pixel of error, which can be quite significant, depending on the scale of the graph.