Archive for August, 2007

How do I limit the number of simultaneous web connections from a client system via the Squid proxy server?

 

Using Squid’s maxconn Access Control List (acl) element it is possible to limit simultaneous web connections coming from client systems that use a Squid proxy server. To use the maxconn element, the “client_db on” directive should be set. By default this is activated in the /etc/squid/squid.conf configuration file.

Edit the /etc/squid/squid.conf file and look for the fields listed below to be able put the maxconn parameters afterwards:

# INSERT YOUR OWN RULE(S) HERE TO ALLOW ACCESS FROM YOUR CLIENTS

# Example rule allowing access from your local networks. Adapt
# to list your (internal) IP networks from where browsing should
# be allowed
#acl our_networks src 192.168.1.0/24 192.168.2.0/24
#http_access allow our_networks

Here is an example of limiting 5 simultaneous web access from the same client system:

acl STUDENTS 192.168.3.0/24
acl numbercon maxconn 5
http_access deny STUDENTS numbercon

Restart the Squid service for the changes to take effect:

service squid restart

http://www.redhat.com/magazine/011sep05/departments/tips_tricks/

     

    : Thirteen Simple Rules for Speeding Up Your Web Site

     


    High Performance Web Sites
    by Steve Souders,
    Chief Performance Yahoo!
    available in September from
    O’Reilly Media

    1. Make Fewer HTTP Requests
    2. Use a Content Delivery Network
    3. Add an Expires Header
    4. Gzip Components
    5. Put CSS at the Top
    6. Move Scripts to the Bottom
    7. Avoid CSS Expressions
    8. Make JavaScript and CSS External
    9. Reduce DNS Lookups
    10. Minify JavaScript
    11. Avoid Redirects
    12. Remove Duplicate Scripts
    13. Configure ETags

    1: Minimize HTTP Requests

    80% of the end-user response time is spent on the front-end. Most of this time is tied up in downloading all the components in the page: images, stylesheets, scripts, Flash, etc. Reducing the number of components in turn reduces the number of HTTP requests required to render the page. This is the key to faster pages.

    One way to reduce the number of components in the page is to simplify the page’s design. But is there a way to build pages with richer content while also achieving fast response times? Here are some techniques for reducing the number of HTTP requests, while still supporting rich page designs.

    Image maps combine multiple images into a single image. The overall size is about the same, but reducing the number of HTTP requests speeds up the page. Image maps only work if the images are contiguous in the page, such as a navigation bar. Defining the coordinates of image maps can be tedious and error prone.

    CSS Sprites are the preferred method for reducing the number of image requests. Combine all the images in your page into a single image and use the CSS background-image and background-position properties to display the desired image segment.

    Inline images use the data: URL scheme to embed the image data in the actual page. This can increase the size of your HTML document. Combining inline images into your (cached) stylesheets is a way to reduce HTTP requests and avoid increasing the size of your pages.

    Combined files are a way to reduce the number of HTTP requests by combining all scripts into a single script, and similarly combining all stylesheets into a single stylesheet. It’s a simple idea that hasn’t seen wide adoption. The ten top U.S. web sites average 7 scripts and 2 stylesheets per page. Combining files is more challenging when the scripts and stylesheets vary from page to page, but making this part of your release process improves response times.

    Reducing the number of HTTP requests in your page is the place to start. This is the most important guideline for improving performance for first time visitors. As described in Tenni Theurer’s blog Browser Cache Usage – Exposed!, 40-60% of daily visitors to your site come in with an empty cache. Making your page fast for these first time visitors is key to a better user experience.

    Discuss Rule 1

    2: Use a Content Delivery Network

    The user’s proximity to your web server has an impact on response times. Deploying your content across multiple, geographically dispersed servers will make your pages load faster from the user’s perspective. But where should you start?

    As a first step to implementing geographically dispersed content, don’t attempt to redesign your web application to work in a distributed architecture. Depending on the application, changing the architecture could include daunting tasks such as synchronizing session state and replicating database transactions across server locations. Attempts to reduce the distance between users and your content could be delayed by, or never pass, this application architecture step.

    Remember that 80-90% of the end-user response time is spent downloading all the components in the page: images, stylesheets, scripts, Flash, etc. This is the Performance Golden Rule, as explained in The Importance of Front-End Performance. Rather than starting with the difficult task of redesigning your application architecture, it’s better to first disperse your static content. This not only achieves a bigger reduction in response times, but it’s easier thanks to content delivery networks.

    A content delivery network (CDN) is a collection of web servers distributed across multiple locations to deliver content more efficiently to users. The server selected for delivering content to a specific user is typically based on a measure of network proximity. For example, the server with the fewest network hops or the server with the quickest response time is chosen.

    Some large Internet companies own their own CDN, but it’s cost-effective to use a CDN service provider, such as Akamai Technologies, Mirror Image Internet, or Limelight Networks. For start-up companies and private web sites, the cost of a CDN service can be prohibitive, but as your target audience grows larger and becomes more global, a CDN is necessary to achieve fast response times. At Yahoo!, properties that moved static content off their application web servers to a CDN improved end-user response times by 20% or more. Switching to a CDN is a relatively easy code change that will dramatically improve the speed of your web site.

    Discuss Rule 2

    3: Add an Expires Header

    Web page designs are getting richer and richer, which means more scripts, stylesheets, images, and Flash in the page. A first-time visitor to your page may have to make several HTTP requests, but by using the Expires header you make those components cacheable. This avoids unnecessary HTTP requests on subsequent page views. Expires headers are most often used with images, but they should be used on all components including scripts, stylesheets, and Flash components.

    Browsers (and proxies) use a cache to reduce the number and size of HTTP requests, making web pages load faster. A web server uses the Expires header in the HTTP response to tell the client how long a component can be cached. This is a far future Expires header, telling the browser that this response won’t be stale until April 15, 2010.

          Expires: Thu, 15 Apr 2010 20:00:00 GMT

    If your server is Apache, use the ExiresDefault directive to set an expiration date relative to the current date. This example of the ExpiresDefault directive sets the Expires date 10 years out from the time of the request.

          ExpiresDefault "access plus 10 years"

    Keep in mind, if you use a far future Expires header you have to change the component’s filename whenever the component changes. At Yahoo! we often make this step part of the build process: a version number is embedded in the component’s filename, for example, yahoo_2.0.6.js.

    Using a far future Expires header affects page views only after a user has already visited your site. It has no effect on the number of HTTP requests when a user visits your site for the first time and the browser’s cache is empty. The impact of this performance improvement depends, therefore, on how often users hit your pages with a primed cache. (A “primed cache” already contains all of the components in the page.) We measured this at Yahoo! and found the number of page views with a primed cache is 75-85%. By using a far future Expires header, you increase the number of components that are cached by the browser and re-used on subsequent page views without sending a single byte over the user’s Internet connection.

    Discuss Rule 3

    4: Gzip Components

    The time it takes to transfer an HTTP request and response across the network can be significantly reduced by decisions made by front-end engineers. It’s true that the end-user’s bandwidth speed, Internet service provider, proximity to peering exchange points, etc. are beyond the control of the development team. But there are other variables that affect response times. Compression reduces response times by reducing the size of the HTTP response.

    Starting with HTTP/1.1, web clients indicate support for compression with the Accept-Encoding header in the HTTP request.

          Accept-Encoding: gzip, deflate

    If the web server sees this header in the request, it may compress the response using one of the methods listed by the client. The web server notifies the web client of this via the Content-Encoding header in the response.

          Content-Encoding: gzip

    Gzip is the most popular and effective compression method at this time. It was developed by the GNU project and standardized by RFC 1952. The only other compression format you’re likely to see is deflate, but it’s less effective and less popular.

    Gzipping generally reduces the response size by about 70%. Approximately 90% of today’s Internet traffic travels through browsers that claim to support gzip. If you use Apache, the module configuring gzip depends on your version: Apache 1.3 uses mod_gzip while Apache 2.x uses mod_deflate.

    There are known issues with browsers and proxies that may cause a mismatch in what the browser expects and what it receives with regard to compressed content. Fortunately, these edge cases are dwindling as the use of older browsers drops off. The Apache modules help out by adding appropriate Vary response headers automatically.

    Servers choose what to gzip based on file type, but are typically too limited in what they decide to compress. Most web sites gzip their HTML documents. It’s also worthwhile to gzip your scripts and stylesheets, but many web sites miss this opportunity. In fact, it’s worthwhile to compress any text response including XML and JSON. Image and PDF files should not be gzipped because they are already compressed. Trying to gzip them not only wastes CPU but can potentially increase file sizes.

    Gzipping as many file types as possible is an easy way to reduce page weight and accelerate the user experience.

    Discuss Rule 4

    5: Put CSS at the Top

    While researching performance at Yahoo!, we discovered that moving stylesheets to the document HEAD makes pages load faster. This is because putting stylesheets in the HEAD allows the page to render progressively.

    Front-end engineers that care about performance want a page to load progressively; that is, we want the browser to display whatever content it has as soon as possible. This is especially important for pages with a lot of content and for users on slower Internet connections. The importance of giving users visual feedback, such as progress indicators, has been well researched and documented. In our case the HTML page is the progress indicator! When the browser loads the page progressively the header, the navigation bar, the logo at the top, etc. all serve as visual feedback for the user who is waiting for the page. This improves the overall user experience.

    The problem with putting stylesheets near the bottom of the document is that it prohibits progressive rendering in many browsers, including Internet Explorer. Browsers block rendering to avoid having to redraw elements of the page if their styles change. The user is stuck viewing a blank white page. Firefox doesn’t block rendering, which means when the stylesheet is done loading it’s possible elements in the page will have to be redrawn, resulting in the flash of unstyled content problem.

    The HTML specification clearly states that stylesheets are to be included in the HEAD of the page: “Unlike A, [LINK] may only appear in the HEAD section of a document, although it may appear any number of times.” Neither of the alternatives, the blank white screen or flash of unstyled content, are worth the risk. The optimal solution is to follow the HTML specification and load your stylesheets in the document HEAD.

    Discuss Rule 5

    6: Move Scripts to the Bottom

    Rule 5 described how stylesheets near the bottom of the page prohibit progressive rendering, and how moving them to the document HEAD eliminates the problem. Scripts (external JavaScript files) pose a similar problem, but the solution is just the opposite: it’s better to move scripts from the top to as low in the page as possible. One reason is to enable progressive rendering, but another is to achieve greater download parallelization.

    With stylesheets, progressive rendering is blocked until all stylesheets have been downloaded. That’s why it’s best to move stylesheets to the document HEAD, so they get downloaded first and rendering isn’t blocked. With scripts, progressive rendering is blocked for all content below the script. Moving scripts as low in the page as possible means there’s more content above the script that is rendered sooner.

    The second problem caused by scripts is blocking parallel downloads. The HTTP/1.1 specification suggests that browsers download no more than two components in parallel per hostname. If you serve your images from multiple hostnames, you can get more than two downloads to occur in parallel. (I’ve gotten Internet Explorer to download over 100 images in parallel.) While a script is downloading, however, the browser won’t start any other downloads, even on different hostnames.

    In some situations it’s not easy to move scripts to the bottom. If, for example, the script uses document.write to insert part of the page’s content, it can’t be moved lower in the page. There might also be scoping issues. In many cases, there are ways to workaround these situations.

    An alternative suggestion that often comes up is to use deferred scripts. The DEFER attribute indicates that the script does not contain document.write, and is a clue to browsers that they can continue rendering. Unfortunately, Firefox doesn’t support the DEFER attribute. In Internet Explorer, the script may be deferred, but not as much as desired. If a script can be deferred, it can also be moved to the bottom of the page. That will make your web pages load faster.

    Discuss Rule 6

    7: Avoid CSS Expressions

    CSS expressions are a powerful (and dangerous) way to set CSS properties dynamically. They’re supported in Internet Explorer, starting with version 5. As an example, the background color could be set to alternate every hour using CSS expressions.

          background-color: expression( (new Date()).getHours()%2 ? "#B8D4FF" : "#F08A00" );

    As shown here, the expression method accepts a JavaScript expression. The CSS property is set to the result of evaluating the JavaScript expression. The expression method is ignored by other browsers, so it is useful for setting properties in Internet Explorer needed to create a consistent experience across browsers.

    The problem with expressions is that they are evaluated more frequently than most people expect. Not only are they evaluated when the page is rendered and resized, but also when the page is scrolled and even when the user moves the mouse over the page. Adding a counter to the CSS expression allows us to keep track of when and how often a CSS expression is evaluated. Moving the mouse around the page can easily generate more than 10,000 evaluations.

    One way to reduce the number of times your CSS expression is evaluated is to use one-time expressions, where the first time the expression is evaluated it sets the style property to an explicit value, which replaces the CSS expression. If the style property must be set dynamically throughout the life of the page, using event handlers instead of CSS expressions is an alternative approach. If you must use CSS expressions, remember that they may be evaluated thousands of times and could affect the performance of your page.

    Discuss Rule 7

    8: Make JavaScript and CSS External

    Many of these performance rules deal with how external components are managed. However, before these considerations arise you should ask a more basic question: Should JavaScript and CSS be contained in external files, or inlined in the page itself?

    Using external files in the real world generally produces faster pages because the JavaScript and CSS files are cached by the browser. JavaScript and CSS that are inlined in HTML documents get downloaded every time the HTML document is requested. This reduces the number of HTTP requests that are needed, but increases the size of the HTML document. On the other hand, if the JavaScript and CSS are in external files cached by the browser, the size of the HTML document is reduced without increasing the number of HTTP requests.

    The key factor, then, is the frequency with which external JavaScript and CSS components are cached relative to the number of HTML documents requested. This factor, although difficult to quantify, can be gauged using various metrics. If users on your site have multiple page views per session and many of your pages re-use the same scripts and stylesheets, there is a greater potential benefit from cached external files.

    Many web sites fall in the middle of these metrics. For these properties, the best solution generally is to deploy the JavaScript and CSS as external files. The only exception I’ve seen where inlining is preferable is with home pages, such as Yahoo!’s front page (http://www.yahoo.com) and My Yahoo! (http://my.yahoo.com). Home pages that have few (perhaps only one) page view per session may find that inlining JavaScript and CSS results in faster end-user response times.

    For front pages that are typically the first of many page views, there are techniques that leverage the reduction of HTTP requests that inlining provides, as well as the caching benefits achieved through using external files. One such technique is to inline JavaScript and CSS in the front page, but dynamically download the external files after the page has finished loading. Subsequent pages would reference the external files that should already be in the browser’s cache.

    Discuss Rule 8

    9: Reduce DNS Lookups

    The Domain Name System (DNS) maps hostnames to IP addresses, just as phonebooks map people’s names to their phone numbers. When you type http://www.yahoo.com into your browser, a DNS resolver contacted by the browser returns that server’s IP address. DNS has a cost. It typically takes 20-120 milliseconds for DNS to lookup the IP address for a given hostname. The browser can’t download anything from this hostname until the DNS lookup is completed.

    DNS lookups are cached for better performance. This caching can occur on a special caching server, maintained by the user’s ISP or local area network, but there is also caching that occurs on the individual user’s computer. The DNS information remains in the operating system’s DNS cache (the “DNS Client service” on Microsoft Windows). Most browsers have their own caches, separate from the operating system’s cache. As long as the browser keeps a DNS record in its own cache, it doesn’t bother the operating system with a request for the record.

    Internet Explorer caches DNS lookups for 30 minutes by default, as specified by the DnsCacheTimeout registry setting. Firefox caches DNS lookups for 1 minute, controlled by the network.dnsCacheExpiration configuration setting. (Fasterfox changes this to 1 hour.)

    When the client’s DNS cache is empty (for both the browser and the operating system), the number of DNS lookups is equal to the number of unique hostnames in the web page. This includes the hostnames used in the page’s URL, images, script files, stylesheets, Flash objects, etc. Reducing the number of unique hostnames reduces the number of DNS lookups.

    Reducing the number of unique hostnames has the potential to reduce the amount of parallel downloading that takes place in the page. Avoiding DNS lookups cuts response times, but reducing parallel downloads may increase response times. My guideline is to split these components across at least two but no more than four hostnames. This results in a good compromise between reducing DNS lookups and allowing a high degree of parallel downloads.

    Discuss Rule 9

    10: Minify JavaScript

    Minification is the practice of removing unnecessary characters from code to reduce its size thereby improving load times. When code is minified all comments are removed, as well as unneeded white space characters (space, newline, and tab). In the case of JavaScript, this improves response time performance because the size of the downloaded file is reduced. The most popular tool for minifying JavaScript code code is JSMin, developed by Douglas Crockford, a fellow Yahoo!.

    Obfuscation is an alternative optimization that can be applied to source code. Like minification, it removes comments and white space, but it also munges the code. As part of munging, function and variable names are converted into smaller strings making the code more compact as well as harder to read. This is typically done to make it more difficult to reverse engineer the code. But munging can help performance because it reduces the code size beyond what is achieved by minification. The tool-of-choice is less clear in the area of JavaScript obfuscation. Dojo Compressor (ShrinkSafe) is the one I’ve seen used the most.

    Minification is a safe, fairly straightforward process. Obfuscation, on the other hand, is more complex and thus more likely to generate bugs as a result of the obfuscation step itself. Obfuscation also requires modifying your code to indicate API functions and other symbols that should not be munged. It also makes it harder to debug your code in production. Although I’ve never seen problems introduced from minification, I have seen bugs caused by obfuscation. In a survey of ten top U.S. web sites, minification achieved a 21% size reduction versus 25% for obfuscation. Although obfuscation has a higher size reduction, I recommend minifying JavaScript code because of the reduced risks and maintenance costs.

    In addition to minifying external scripts, inlined script blocks can and should also be minified. Even if you gzip your scripts, as described in Rule 4, minifying them will still reduce the size by 5% or more. As the use and size of JavaScript increases, so will the savings gained by minifying your JavaScript code.

    Discuss Rule 10

    11: Avoid Redirects

    Redirects are accomplished using the 301 and 302 status codes. Here’s an example of the HTTP headers in a 301 response:

          HTTP/1.1 301 Moved Permanently       Location: http://example.com/newuri       Content-Type: text/html

    The browser automatically takes the user to the URL specified in the Location field. All the information necessary for a redirect is in the headers. The body of the response is typically empty. Despite their names, neither a 301 nor a 302 response is cached in practice unless additional headers, such as Expires or Cache-Control, indicate it should be. The meta refresh tag and JavaScript are other ways to direct users to a different URL, but if you must do a redirect, the preferred technique is to use the standard 3xx HTTP status codes, primarily to ensure the back button works correctly.

    The main thing to remember is that redirects slow down the user experience. Inserting a redirect between the user and the HTML document delays everything in the page since nothing in the page can be rendered and no components can start being downloaded until the HTML document has arrived.

    One of the most wasteful redirects happens frequently and web developers are generally not aware of it. It occurs when a trailing slash (/) is missing from a URL that should otherwise have one. For example, going to http://astrology.yahoo.com/astrology results in a 301 response containing a redirect to http://astrology.yahoo.com/astrology/ (notice the added trailing slash). This is fixed in Apache by using Alias or mod_rewrite, or the DirectorySlash directive if you’re using Apache handlers.

    Connecting an old web site to a new one is another common use for redirects. Others include connecting different parts of a website and directing the user based on certain conditions (type of browser, type of user account, etc.). Using a redirect to connect two web sites is simple and requires little additional coding. Although using redirects in these situations reduces the complexity for developers, it degrades the user experience. Alternatives for this use of redirects include using Alias and mod_rewrite if the two code paths are hosted on the same server. If a domain name change is the cause of using redirects, an alternative is to create a CNAME (a DNS record that creates an alias pointing from one domain name to another) in combination with Alias or mod_rewrite.

    Discuss Rule 11

    12: Remove Duplicate Scripts

    It hurts performance to include the same JavaScript file twice in one page. This isn’t as unusual as you might think. A review of the ten top U.S. web sites shows that two of them contain a duplicated script. Two main factors increase the odds of a script being duplicated in a single web page: team size and number of scripts. When it does happen, duplicate scripts hurt performance by creating unnecessary HTTP requests and wasted JavaScript execution.

    Unnecessary HTTP requests happen in Internet Explorer, but not in Firefox. In Internet Explorer, if an external script is included twice and is not cacheable, it generates two HTTP requests during page loading. Even if the script is cacheable, extra HTTP requests occur when the user reloads the page.

    In addition to generating wasteful HTTP requests, time is wasted evaluating the script multiple times. This redundant JavaScript execution happens in both Firefox and Internet Explorer, regardless of whether the script is cacheable.

    One way to avoid accidentally including the same script twice is to implement a script management module in your templating system. The typical way to include a script is to use the SCRIPT tag in your HTML page.

          script type="text/javascript" src="menu_1.0.17.js"></script>

    An alternative in PHP would be to create a function called insertScript.

          <?php insertScript("menu.js") ?>

    In addition to preventing the same script from being inserted multiple times, this function could handle other issues with scripts, such as dependency checking and adding version numbers to script filenames to support far future Expires headers.

    Discuss Rule 12

    13: Configure ETags

    Entity tags (ETags) are a mechanism that web servers and browsers use to determine whether the component in the browser’s cache matches the one on the origin server. (An “entity” is another word for what I’ve been calling a “component”: images, scripts, stylesheets, etc.) ETags were added to provide a mechanism for validating entities that is more flexible than the last-modified date. An ETag is a string that uniquely identifies a specific version of a component. The only format constraints are that the string be quoted. The origin server specifies the component’s ETag using the ETag response header.

          HTTP/1.1 200 OK       Last-Modified: Tue, 12 Dec 2006 03:03:59 GMT       ETag: "10c24bc-4ab-457e1c1f"       Content-Length: 12195

    Later, if the browser has to validate a component, it uses the If-None-Match header to pass the ETag back to the origin server. If the ETags match, a 304 status code is returned reducing the response by 12195 bytes for this example.

          GET /i/yahoo.gif HTTP/1.1       Host: us.yimg.com       If-Modified-Since: Tue, 12 Dec 2006 03:03:59 GMT       If-None-Match: "10c24bc-4ab-457e1c1f"       HTTP/1.1 304 Not Modified

    The problem with ETags is that they typically are constructed using attributes that make them unique to a specific server hosting a site. ETags won’t match when a browser gets the original component from one server and later tries to validate that component on a different server, a situation that is all too common on Web sites that use a cluster of servers to handle requests. By default, both Apache and IIS embed data in the ETag that dramatically reduces the odds of the validity test succeeding on web sites with multiple servers.

    The ETag format for Apache 1.3 and 2.x is inode-size-timestamp. Although a given file may reside in the same directory across multiple servers, and have the same file size, permissions, timestamp, etc., its inode is different from one server to the next.

    IIS 5.0 and 6.0 have a similar issue with ETags. The format for ETags on IIS is Filetimestamp:ChangeNumber. A ChangeNumber is a counter used to track configuration changes to IIS. It’s unlikely that the ChangeNumber is the same across all IIS servers behind a web site.

    The end result is ETags generated by Apache and IIS for the exact same component won’t match from one server to another. If the ETags don’t match, the user doesn’t receive the small, fast 304 response that ETags were designed for; instead, they’ll get a normal 200 response along with all the data for the component. If you host your web site on just one server, this isn’t a problem. But if you have multiple servers hosting your web site, and you’re using Apache or IIS with the default ETag configuration, your users are getting slower pages, your servers have a higher load, you’re consuming greater bandwidth, and proxies aren’t caching your content efficiently. Even if your components have a far future Expires header, a conditional GET request is still made whenever the user hits Reload or Refresh.

    If you’re not taking advantage of the flexible validation model that ETags provide, it’s better to just remove the ETag altogether. The Last-Modified header validates based on the component’s timestamp. And removing the ETag reduces the size of the HTTP headers in both the response and subsequent requests. This Microsoft Support article describes how to remove ETags. In Apache, this is done by simply adding the following line to your Apache configuration file:

          FileETag none

    Discuss Rule 13

    netcat utility (nc command) considered as TCP/IP swiss army knife. It reads and writes data across network connections, using TCP or UDP protocol. It is designed to be a reliable “back-end” tool that can be used directly or easily driven by other programs and scripts. At the same time, it is a feature-rich network debugging and exploration tool, since it can create almost any kind of connection you would need and has several interesting built-in capabilities.

    I also install the netcat package for administering a network and you’d like to use its debugging and network exploration capabilities.

    One my favorite usage is to migrating data between two server hard drives using netcat over a network. It is very easy to copy complete drive image from one server to another.

    You can also use ssh for the same purpose, but encryption adds its own overheads. This is tried and trusted method (hat tip to karl) .

    Make sure you have backup of all important data.

    Install netcat

    It is possible that nc may not be installed by default under Redhat / CentOS / Debian Linux.

    Insall nc under Redhat / CentOS / Fedora Linux

    Use yum command as follows:
    # yum install nc
    Output:

    Loading "installonlyn" plugin Loading "rhnplugin" plugin Setting up Install Process Setting up repositories rhel-x86_64-server-vt-5   100% |=========================| 1.2 kB    00:00 rhel-x86_64-server-5      100% |=========================| 1.2 kB    00:00 Reading repository metadata in from local files Parsing package install arguments Resolving Dependencies --> Populating transaction set with selected packages. Please wait. ---> Downloading header for nc to pack into transaction set. nc-1.84-10.fc6.x86_64.rpm 100% |=========================| 6.9 kB    00:00 ---> Package nc.x86_64 0:1.84-10.fc6 set to be updated --> Running transaction check  Dependencies Resolved  =============================================================================  Package                 Arch       Version          Repository        Size ============================================================================= Installing:  nc                      x86_64     1.84-10.fc6      rhel-x86_64-server-5   56 k  Transaction Summary ============================================================================= Install      1 Package(s) Update       0 Package(s) Remove       0 Package(s)           Total download size: 56 k Is this ok [y/N]: y Downloading Packages: (1/1): nc-1.84-10.fc6.x86 100% |=========================|  56 kB    00:00 Running Transaction Test Finished Transaction Test Transaction Test Succeeded Running Transaction   Installing: nc                           ######################### [1/1]   Installed: nc.x86_64 0:1.84-10.fc6 Complete!

    How do I use netcat to copy hard disk image?

    Our sample setup
    ----------------------- HostA // 192.168.1.1 ------------------------            sda         NETWORK            sdb ------------------------ HostB // 192.168.1.2 -------------------------

    Your task is copy HostA /dev/sda to HostB’s /dev/sdb using netcat command. First login as root user

    Command to type on hostB (receiving end ~ write image mode)

    You need to open port on hostB using netcat, enter :
    # netcat -p 2222 -l |bzip2 -d | dd of=/dev/sdb
    Where,

    • -p 2222 : Specifies the source port nc should use, subject to privilege restrictions and availability. Make sure port 2222 is not used by another process.
    • -l : Used to specify that nc should listen for an incoming connection rather than initiate a connection to a remote host.
    • bzip2 -d : Compresses image using the Burrows-Wheeler block sorting text compression algorithm, and Huffman coding. This will speed up network transfer ( -d : force decompression mode)
    • dd of=/dev/sda : /dev/sda is your hard disk. You can also specify partition such as /dev/sda1

    Command to type on hostA (send data over a network ~ read image mode)

    Now all you have to do is start copying image. Again login as root and enter:
    # bzip2 -c /dev/sda | netcat hostA 2222
    OR use IP address:
    # bzip2 -c /dev/sda | netcat 192.168.1.1 2222

    This process takes its own time.

    A note about latest netcat version 1.84-10 and above

    If you are using latest nc / netcat version above syntax will generate an error. It is an error to use -l option in conjunction with the -p, -s, or -z options. Additionally, any timeouts specified with the -w option are ignored. So use nc command as follows.

    On hostA, enter:
    # nc -l 2222 > /dev/sdb
    On hostB, enter:
    # nc hostA 2222< /dev/sda
    OR
    # nc 192.168.1.1 2222< /dev/sda

    Using a second machine (hostB), connect to the listening nc process at 2222 (hostA), feeding it the file (/dev/sda)which is to be transferred. You can use bzip2 as follows.
    On hostA, enter:
    # nc -l 2222 | bzip2 -d > /dev/sdb
    On hostB, enter:
    # bzip2 -c /dev/sda | nc 192.168.1.1 2222

    Further readings

    How do I improve performance?

    As suggested by anonymous user:

    You should definitely use bs=16M or something like that. Otherwise, the copy will take forever. Copying a 300 GB hard drive over a 1 Gbps cross-over cable took about 1 1/2 hours or so using bs=16M Without this option, the same thing would have taken about 7 hours.

    In short use command as follows:
    # netcat -p 2222 -l |bzip2 -d | dd of=/dev/sdb bs=16M

    My Setup:

    i) System: HP dual Xeon CPU system with 8 GB RAM (good for squid).
    ii) Eth0: IP:192.168.1.1
    iii) Eth1: IP: 192.168.2.1 (192.168.2.0/24 network (around 150 windows XP systems))
    iv) OS: Red Hat Enterprise Linux 4.0 (Following instruction should work with Debian and all other Linux distros)

    Eth0 connected to internet and eth1 connected to local lan i.e. system act as router.

     

    Server Configuration

    • Step #1 : Squid configuration so that it will act as a transparent proxy
    • Step #2 : Iptables configuration
      • a) Configure system as router
      • b) Forward all http requests to 3128 (DNAT)
    • Step #3: Run scripts and start squid service

    First, Squid server installed (use up2date squid) and configured by adding following directives to file:
    # vi /etc/squid/squid.conf

    Modify or add following squid directives:
    httpd_accel_host virtual
    httpd_accel_port 80
    httpd_accel_with_proxy on
    httpd_accel_uses_host_header on
    acl lan src 192.168.1.1 192.168.2.0/24
    http_access allow localhost
    http_access allow lan

    Where,

    • httpd_accel_host virtual: Squid as an httpd accelerator
    • httpd_accel_port 80: 80 is port you want to act as a proxy
    • httpd_accel_with_proxy on: Squid act as both a local httpd accelerator and as a proxy.
    • httpd_accel_uses_host_header on: Header is turned on which is the hostname from the URL.
    • acl lan src 192.168.1.1 192.168.2.0/24: Access control list, only allow LAN computers to use squid
    • http_access allow localhost: Squid access to LAN and localhost ACL only
    • http_access allow lan: — same as above –

    Here is the complete listing of squid.conf for your reference (grep will remove all comments and sed will remove all empty lines, thanks to David Klein for quick hint ):
    # grep -v "^#" /etc/squid/squid.conf | sed -e '/^$/d'

    OR, try out sed (thanks to kotnik for small sed trick)
    # cat /etc/squid/squid.conf | sed '/ *#/d; /^ *$/d'

    Output:
    hierarchy_stoplist cgi-bin ?
    acl QUERY urlpath_regex cgi-bin \?
    no_cache deny QUERY
    hosts_file /etc/hosts
    refresh_pattern ^ftp: 1440 20% 10080
    refresh_pattern ^gopher: 1440 0% 1440
    refresh_pattern . 0 20% 4320
    acl all src 0.0.0.0/0.0.0.0
    acl manager proto cache_object
    acl localhost src 127.0.0.1/255.255.255.255
    acl to_localhost dst 127.0.0.0/8
    acl purge method PURGE
    acl CONNECT method CONNECT
    cache_mem 1024 MB
    http_access allow manager localhost
    http_access deny manager
    http_access allow purge localhost
    http_access deny purge
    http_access deny !Safe_ports
    http_access deny CONNECT !SSL_ports
    acl lan src 192.168.1.1 192.168.2.0/24
    http_access allow localhost
    http_access allow lan
    http_access deny all
    http_reply_access allow all
    icp_access allow all
    visible_hostname myclient.hostname.com
    httpd_accel_host virtual
    httpd_accel_port 80
    httpd_accel_with_proxy on
    httpd_accel_uses_host_header on
    coredump_dir /var/spool/squid

    Iptables configuration

    Next, I had added following rules to forward all http requests (coming to port 80) to the Squid server port 3128 :
    iptables -t nat -A PREROUTING -i eth1 -p tcp –dport 80 -j DNAT –to 192.168.1.1:3128
    iptables -t nat -A PREROUTING -i eth0 -p tcp –dport 80 -j REDIRECT –to-port 3128

    Here is complete shell script. Script first configure Linux system as router and forwards all http request to port 3128 (Download the fw.proxy shell script):
    #!/bin/sh
    # squid server IP
    SQUID_SERVER=“192.168.1.1″
    # Interface connected to Internet
    INTERNET=“eth0″
    # Interface connected to LAN
    LAN_IN=“eth1″
    # Squid port
    SQUID_PORT=“3128″
    # DO NOT MODIFY BELOW
    # Clean old firewall
    iptables -F
    iptables -X
    iptables -t nat -F
    iptables -t nat -X
    iptables -t mangle -F
    iptables -t mangle -X
    # Load IPTABLES modules for NAT and IP conntrack support
    modprobe ip_conntrack
    modprobe ip_conntrack_ftp
    # For win xp ftp client
    #modprobe ip_nat_ftp
    echo 1 > /proc/sys/net/ipv4/ip_forward
    # Setting default filter policy
    iptables -P INPUT DROP
    iptables -P OUTPUT ACCEPT
    # Unlimited access to loop back
    iptables -A INPUT -i lo -j ACCEPT
    iptables -A OUTPUT -o lo -j ACCEPT
    # Allow UDP, DNS and Passive FTP
    iptables -A INPUT -i $INTERNET -m state --state ESTABLISHED,RELATED -j ACCEPT
    # set this system as a router for Rest of LAN
    iptables --table nat --append POSTROUTING --out-interface $INTERNET -j MASQUERADE
    iptables --append FORWARD --in-interface $LAN_IN -j ACCEPT
    # unlimited access to LAN
    iptables -A INPUT -i $LAN_IN -j ACCEPT
    iptables -A OUTPUT -o $LAN_IN -j ACCEPT
    # DNAT port 80 request comming from LAN systems to squid 3128 ($SQUID_PORT) aka transparent proxy
    iptables -t nat -A PREROUTING -i $LAN_IN -p tcp --dport 80 -j DNAT --to $SQUID_SERVER:$SQUID_PORT
    # if it is same system
    iptables -t nat -A PREROUTING -i $INTERNET -p tcp --dport 80 -j REDIRECT --to-port $SQUID_PORT
    # DROP everything and Log it
    iptables -A INPUT -j LOG
    iptables -A INPUT -j DROP

    Save shell script. Execute script so that system will act as a router and forward the ports:
    # chmod +x /etc/fw.proxy
    # /etc/fw.proxy
    # service iptables save
    # chkconfig iptables on

    Start or Restart the squid:
    # /etc/init.d/squid restart
    # chkconfig squid on

    Desktop / Client computer configuration

    Point all desktop clients to your eth1 IP address (192.168.2.1) as Router/Gateway (use DHCP to distribute this information). You do not have to setup up individual browsers to work with proxies.

    How do I test my squid proxy is working correctly?

    See access log file /var/log/squid/access.log:
    # tail -f /var/log/squid/access.log

    Above command will monitor all incoming request and log them to /var/log/squid/access_log file. Now if somebody accessing a website through browser, squid will log information.

    Problems and solutions

    (a) Windows XP FTP Client

    All Desktop client FTP session request ended with an error:
    Illegal PORT command.

    I had loaded the ip_nat_ftp kernel module. Just type the following command press Enter and voila!
    # modprobe ip_nat_ftp

    Please note that modprobe command is already added to a shell script (above).

    (b) Port 443 redirection

    I had block out all connection request from our router settings except for our proxy (192.168.1.1) server. So all ports including 443 (https/ssl) request denied. You cannot redirect port 443, from debian mailing list, “Long answer: SSL is specifically designed to prevent “man in the middle” attacks, and setting up squid in such a way would be the same as such a “man in the middle” attack. You might be able to successfully achive this, but not without breaking the encryption and certification that is the point behind SSL“.

    Therefore, I had quickly reopen port 443 (router firewall) for all my LAN computers and problem was solved.

    (c) Squid Proxy authentication in a transparent mode

    You cannot use Squid authentication with a transparently intercepting proxy.

     From NixCraft