THE HANDSTAND

THE HANDSTAND

JANUARY 2006

Internet Censorship: The Warning Signs Were Not Hidden

The warning signs for the crackdown on the web have been with us for over a decade. The Clipper chip controversy of the 90s, John Poindexter’s Total Information Awareness (TIA) system pushed in the aftermath of 9-11, backroom deals between the Federal government and the Internet service industry, and the Patriot Act have ushered in a new era of Internet censorship, something just half a decade ago computer programmers averred was impossible given the nature of the web. They were wrong, dead wrong.

Take for example of what recently occurred when two journalists were taking on the phone about a story that appeared on Google News. The story was about a Christian fundamentalist move in Congress to use U.S. military force in Sudan to end genocide in Darfur. The story appeared on the English Google News site in Qatar. But the very same Google News site when accessed simultaneously in Washington, DC failed to show the article. This censorship is accomplished by geolocation filtering: the restriction or modifying of web content based on the geographical region of the user. In addition to countries, such filtering can now be implemented for states, cities, and even individual IP addresses.

With reports in the Swedish newspaper Svensa Dagbladet today that the United States has transmitted a Homeland Security Department "no fly" list of 80,000 suspected terrorists to airport authorities around the world, it is not unreasonable that a "no [or restricted] surfing/emailing" list has been transmitted to Internet Service Providers around the world. The systematic disruptions of web sites and email strongly suggests that such a list exists.

News reports on CIA prisoner flights and secret prisons are disappearing from Google and other search engines like Alltheweb as fast as they appear. Here now, gone tomorrow is the name of the game.

Google is systematically failing to list and link to articles that contain explosive information about the Bush administration, the war in Iraq, Al Qaeda, and U.S. political scandals. But Google is not alone in working closely to stifle Internet discourse. America On Line, Microsoft, Yahoo and others are slowly turning the Internet into an information superhighway dominated by barricades, toll booths, off-ramps that lead to dead ends, choke points, and security checks.

America On Line is the most egregious is stifling Internet freedom. A former AOL employee noted how AOL and other Internet Service Providers cooperate with the Bush administration in censoring email. The Patriot Act gave federal agencies the power to review information to the packet level and AOL was directed by agencies like the FBI to do more than sniff the subject line. The AOL term of service (TOS) has gradually been expanded to grant AOL virtually universal power regarding information. Many AOL users are likely unaware of the elastic clause, which says they will be bound by the current TOS and any TOS revisions which AOL may elect at any time in the future. Essentially, AOL users once agreed to allow the censorship and non-delivery of their email.

Microsoft has similar requirements for Hotmail as do Yahoo and Google for their respective e-mail services.

There are also many cases of Google’s search engine failing to list and link to certain information. According to a number of web site administrators who carry anti-Bush political content, this situation has become more pronounced in the last month. In addition, many web site administrators are reporting a dramatic drop-off in hits to their sites, according to their web statistic analyzers. Adding to their woes is the frequency at which spam viruses are being spoofed as coming from their web site addresses.

Government disruption of the political side of the web can easily be hidden amid hyped mainstream news media reports of the latest "boutique" viruses and worms, reports that have more to do with the sales of anti-virus software and services than actual long-term disruption of banks, utilities, or airlines.

computercensor.gif (11778 bytes)

Internet Censorship in the US: No Longer a Prediction

Google, Microsoft, Yahoo, and Cisco Systems have honed their skills at Internet censorship for years in places like China, Jordan, Tunisia, Saudi Arabia, the United Arab Emirates, Vietnam, and other countries. They have learned well. They will be the last to admit they have imported their censorship skills into the United States at the behest of the Bush regime. Last year, the Bush-Cheney campaign blocked international access to its web site -- www.georgewbush.com -- for unspecified "security reasons."

Only those in the Federal bureaucracy and the companies involved are in a position to know what deals have been made and how extensive Internet censorship has become. They owe full disclosure to their customers and their fellow citizens.

Disturbing Facts About Google
December 02, 2005 7:49 PM

"I am only one, but I am one. I cannot do everything, but I
can do something. And because I cannot do everything, I will
not refuse to do the something that I can do. What I can do,
I should do. And what I should do, by the grace of God,
I will do." - Edward Everett Hale

Google are clearly gathering information about us but refuse to tell us why. It's nothing new to us, but while they cannot control normal SERPs, they do however control who is viewing what and when.

We run Google adverts in order to survive. Does this mean we shouldn't share the information below? If you know an alternative way to sustain costs please get in touch.

Please note, Google does not track you by simply viewing pages containing their adverts.

1. Google's immortal cookie:
Google was the first search engine to use a cookie that expires in 2038. This was at a time when federal websites were prohibited from using persistent cookies altogether. Now it's years later, and immortal cookies are commonplace among search engines ; Google set the standard because no one bothered to challenge them. This cookie places a unique ID number on your hard disk. Anytime you land on a Google page, you get a Google cookie if you don't already have one. If you have one, they read and record your unique ID number.

2. Google records everything they can:
For all searches they record the cookie ID, your Internet IP address, the time and date, your search terms, and your browser configuration. Increasingly, Google is customizing results based on your IP number. This is referred to in the industry as "IP delivery based on geolocation."

3. Google retains all data indefinitely:
Google has no data retention policies. There is evidence that they are able to easily access all the user information they collect and save.

4. Google won't say why they need this data:
Inquiries to Google about their privacy policies are ignored. When the New York Times (2002-11-28) asked Sergey Brin about whether Google ever gets subpoenaed for this information, he had no comment.

5. Google hires spooks:
Matt Cutts, a key Google engineer, used to work for the National Security Agency. Google wants to hire more people with security clearances, so that they can peddle their corporate assets to the spooks in Washington.

6. Google's toolbar is spyware:
With the advanced features enabled, Google's free toolbar for Explorer phones home with every page you surf, and yes, it reads your cookie too. Their privacy policy confesses this, but that's only because Alexa lost a class-action lawsuit when their toolbar did the same thing, and their privacy policy failed to explain this. Worse yet, Google's toolbar updates to new versions quietly, and without asking. This means that if you have the toolbar installed, Google essentially has complete access to your hard disk every time you connect to Google (which is many times a day). Most software vendors, and even Microsoft, ask if you'd like an updated version. But not Google. Any software that updates automatically presents a massive security risk, see: http://schram.net/articles/updaterisk.html

7. Google's cache copy is illegal:
Judging from Ninth Circuit precedent on the application of U.S. copyright laws to the Internet, Google's cache copy appears to be illegal. The only way a webmaster can avoid having his site cached on Google is to put a "noarchive" meta in the header of every page on his site. Surfers like the cache, but webmasters don't. Many webmasters have deleted questionable material from their sites, only to discover later that the problem pages live merrily on in Google's cache. The cache copy should be "opt-in" for webmasters, not "opt-out."

8. Google is not your friend:
By now Google enjoys a 75 percent monopoly for all external referrals to most websites. Webmasters cannot avoid seeking Google's approval these days, assuming they want to increase traffic to their site. If they try to take advantage of some of the known weaknesses in Google's semi-secret algorithms, they may find themselves penalized by Google, and their traffic disappears. There are no detailed, published standards issued by Google, and there is no appeal process for penalized sites. Google is completely unaccountable. Most of the time Google doesn't even answer email from webmasters.

9. Google is a privacy time bomb:
With 200 million searches per day, most from outside the U.S., Google amounts to a privacy disaster waiting to happen. Those newly-commissioned data-mining bureaucrats in Washington can only dream about the sort of slick efficiency that Google has already achieved.

google-watch

April 04, 2004

The Secret Source of Google's Power

Much is being written about Gmail, Google's new free webmail system. There's something deeper to learn about Google from this product than the initial reaction to the product features, however. Ignore for a moment the observations about Google leapfrogging their competitors with more user value and a new feature or two. Or Google diversifying away from search into other applications; they've been doing that for a while. Or the privacy red herring.

No, the story is about seemingly incremental features that are actually massively expensive for others to match, and the platform that Google is building which makes it cheaper and easier for them to develop and run web-scale applications than anyone else.

I've written before (see Table No1. below) about Google's snippet service, which required that they store the entire web in RAM. All so they could generate a slightly better page excerpt than other search engines.

Google has taken the last 10 years of systems software research out of university labs, and built their own proprietary, production quality system. What is this platform that Google is building? It's a distributed computing platform that can manage web-scale datasets on 100,000 node server clusters. It includes a petabyte, distributed, fault tolerant filesystem, distributed RPC code, probably network shared memory and process migration. And a datacenter management system which lets a handful of ops engineers effectively run 100,000 servers. Any of these projects could be the sole focus of a startup.

Speculation: Gmail's Architecture and Economics

Let's make some guesses about how one might build a Gmail.

Hotmail has 60 million users. Gmail's design should be comparable, and should scale to 100 million users. It will only have to support a couple of million in the first year though.

The most obvious challenge is the storage. You can't lose people's email, and you don't want to ever be down, so data has to be replicated. RAID is no good; when a disk fails, a human needs to replace the bad disk, or there is risk of data loss if more disks fail. One imagines the old ENIAC technician running up and down the isles of Google's data center with a shopping cart full of spare disk drives instead of vacuum tubes. RAID also requires more expensive hardware -- at least the hot swap drive trays. And RAID doesn't handle high availability at the server level anyway.

No. Google has 100,000 servers.(NYTimes)[nytimes] If a server/disk dies, they leave it dead in the rack, to be reclaimed/replaced later. Hardware failures need to be instantly routed around by software.

Google has built their own distributed, fault-tolerant, petabyte filesystem, the Google Filesystem. This is ideal for the job. Say GFS replicates user email in three places; if a disk or a server dies, GFS can automatically make a new copy from one of the remaining two. Compress the email for a 3:1 storage win, then store user's email in three locations, and their raw storage need is approximately equivalent to the user's mail size.

The Gmail servers wouldn't be top-heavy with lots of disk. They need the CPU for indexing and page view serving anyway. No fancy RAID card or hot-swap trays, just 1-2 disks per 1U server.

It's straightforward to spreadsheet out the economics of the service, taking into account average storage per user, cost of the servers, and monetization per user per year. Google apparently puts the operational cost of storage at $2 per gigabyte. My napkin math comes up with numbers in the same ballpark. I would assume the yearly monetized value of a webmail user to be in the $1-10 range.

Cheap Hardware

Here's an anecdote to illustrate how far Google's cultural approach to hardware cost is different from the norm, and what it means as a component of their competitive advantage.

In a previous job I specified 40 moderately-priced servers to run a new internet search site we were developing. The ops team overrode me; they wanted 6 more expensive servers, since they said it would be easier to manage 6 machines than 40.

What this does is raise the cost of a CPU second. We had engineers that could imagine algorithms that would give marginally better search results, but if the algorithm was 10 times slower than the current code, ops would have to add 10X the number of machines to the datacenter. If you've already got $20 million invested in a modest collection of Suns, going 10X to run some fancier code is not an option.

Google has 100,000 servers.

Any sane ops person would rather go with a fancy $5000 server than a bare $500 motherboard plus disks sitting exposed on a tray. But that's a 10X difference to the cost of a CPU cycle. And this frees up the algorithm designers to invent better stuff.

Without cheap CPU cycles, the coders won't even consider algorithms that the Google guys are deploying. They're just too expensive to run.

Google doesn't deploy bare motherboards on exposed trays anymore; they're on at least the fourth iteration of their cheap hardware platform. Google now has an institutional competence building and maintaining servers that cost a lot less than the servers everyone else is using. And they do it with fewer people.

Think of the little internal factory they must have to deploy servers, and the level of automation needed to run that many boxes. Either network boot or a production line to pre-install disk images. Servers that self-configure on boot to determine their network config and load the latest rev of the software they'll be running. Normal datacenter ops practices don't scale to what Google has.

What are all those OS Researchers doing at Google?

Robert Pikehas gone to Google. Yes, that Rob Pike(Please see the following page on this issue of The Handstand re. Dis the Inferno Virtual Machine. JB,editor)-- the OS researcher, the member of the original Unix team from Bell Labs. This guy isn't just some labs hood ornament; he writes code, lots of it. Big chunks of whole new operating systems like Plan 9 http://blog.topix.net/archives/000011.html

Table No.1.Example from paper on Plan 9
A while ago I was chatting with my old boss Wade about a nifty algorithm I found for incremental search engines, which piggybacked queued writes onto reads that the front end requests were issuing anyway, to minimize excess disk head seeks. I thought it was pretty cool.

Wade smacked me on the head (gently) and asked why I was even thinking about disk anymore. Disk is dead; just put the whole thing in RAM and forget about it, he said.

Orkut is wicked fast; Friendster isn't. How do you reliably make a scalable web service wicked fast? Easy: the whole thing has to be in memory, and user requests must never wait for disk.

A disk head seek is about 9ms, and the human perceptual threshold for what seems "instant" is around 50ms. So if you have just one head seek per user request, you can support at most 5 hits/second on that server before users start to notice latency. If you have a typical filesystem with a little database on top, you may be up to 3+ seeks per hit already. Forget caching; caching helps the second user, and doesn't work on systems with a "long tail" of zillions of seldom-accessed queries, like search.
................................................The biggest RAM database of all...

An overlooked feature that made Google really cool in the beginning was their snippets. This is the excerpt of text that shows a few sample sentences from each web page matching your search. Google's snippets show just the part of the web page that have your search terms in them; other search engines before always showed the same couple of sentences from the start of the web page, no matter what you had searched for.

Consider the insane cost to implement this simple feature. Google has to keep a copy of every web page on the Internet on their servers in order to show you the piece of the web page where your search terms hit. Everything is served from RAM, only booted from disk. And they have multiple separate search clusters at their co-locations. This means that Google is currently storing multiple copies of the entire web in RAM. My napkin is hard to read with all these zeroes on it, but that's a lot of memory. Talk about barrier to entry.

Posted by skrenta at February 2, 2004
Topix.net weblog

Look at the depth of the research background(see table insert below for url.)of the Google employees in OS, networking, and distributed systems. Compiler Optimization. Thread migration. Distributed shared memory.

Table No.2.
List of Google Research Papers: http://labs.google.com/papers.html#os

Below is a partial list of papers written by people now at Google, showing the range of backgrounds of people in Google Engineering.

algorithms

compiler optimization

information retrieval

artificial intelligence

file system design

machine learning

profiling

computer architecture

user interface design

data mining

genetic algorithms

web information retrieval

search engine design

data compression

computer graphics

robotics

text processing

natural language processing

software engineering and design

operating systems and distributed systems

various other topics

I'm a sucker for cool OS research. Browsing papers from Google employees about distributed systems, thread migration, network shared memory, GFS, makes me feel like a kid in Tomorrowland wondering when we're going to Mars. Wouldn't it be great, as an engineer, to have production versions of all this great research.

Google engineers do!

Competitive Advantage

Google is a company that has built a single very large, custom computer. It's running their own cluster operating system. They make their big computer even bigger and faster each month, while lowering the cost of CPU cycles. It's looking more like a general purpose platform than a cluster optimized for a single application.

While competitors are targeting the individual applications Google has deployed, Google is building a massive, general purpose computing platform for web-scale programming.

This computer is running the world's top search engine, a social networking service, a shopping price comparison engine, a new email service, and a local search/yellow pages engine. What will they do next with the world's biggest computer and most advanced operating system?

Posted by skrenta at April 4, 2004 02:11 PM

meantime dear friends a piece of Google propaganda arrives on the desk from The Guardian:

How unscrupulous firms are manipulating world's leading search engine

Bobbie Johnson, technology correspondent
Wednesday December 21, 2005
The Guardian

It is known as the Google dance, a delicate struggle between technicians at the world's largest internet search engine and the spin doctors who manipulate the worldwide web for commercial ends. Every day one group tries to prevent the other from abusing Google's index of more than 8bn web pages.

The stakes are high. As Britain's online shoppers spend £150m every day in the run-up to Christmas, the value of a high ranking on Google is potentially worth millions to retailers, website owners and criminals alike. Web search has become the focus for a legion of marketers over the past decade, and their tactics - known as search engine optimisation (SEO) - are the basis of a multibillion pound worldwide industry. Most are legitimate businesses, but some so-called "black hat" SEOs use unethical strategies to boost their clients.

Spoof site

To test the effectiveness of these tactics, the Guardian created a spoof site and tried to force it up Google's rankings. Over one week, a number of tricks - some similar to those used by black-hat firms - were used to successfully push it to the top.

The spoof site was set up to promote eco-friendly flip-flops, a bogus product promising zero harmful emissions. The simple page featured a disclaimer to make the nature of the experiment clear, and a picture of the goods. At the start of the experiment, there were more than 11,500 results for "eco-friendly flip-flops" on Google, and the spoof site did not feature. Within two days of creating the site, Google's spider - the program that explores the web - had discovered the site and included it in its main index, but it appeared within the lowest 100 pages. The first attempt to boost the ranking was a series of basic instructions intended to manipulate Google, including overloading the page with words that would improve the site's ranking, and adding invisible data intended to boost it even further. This had little effect, however, and the spoof site remained static in Google's index.

Another trick was then used to mimic black-hat behaviour. A second site was created which contained a large number of links to the first. Because Google rates the authority of a site partly by how many times they have been linked to, this ploy can makes a site appear popular. Within hours, the effect was apparent - the spoof site was now the top result in our test search, trumping the other 11,500 sites within days.

Our experiment was small scale and limited in scope, but in the real world the value of success is higher than ever. Black-hat companies offer a range of services at different prices, all aimed at unfairly manipulating search engines. One website found by the Guardian offered customers the chance to download a program to spam Google for $99 (£56), while another - posing as a legitimate SEO -charged several thousand pounds.

Commercial power
Customers trust the results of search engines, which rely on advertising to generate profit and have much to gain from keeping the results clean and everything to lose if they fail. Almost a fifth of all visits to online shopping sites are the direct result of an internet search, according to data monitor Hitwise - and more than half of those come straight from Google. This places an astonishing level of commercial power in the hands of one company. Google says it takes this very seriously, but it is an almost impossible task given the amount of information concerned.

"There is a lot of ongoing algorithmic work to improve the relevancy of our results in general," said Marcus Sandberg, a Google engineer. "Many SEOs do a great job at helping site owners improve their content for users and search engines alike. But some do use methods that we consider manipulative."

Behind Google's legendary algorithm - an immense secret mathematical formula for ranking websites - are a series of well-known principles that can be faked, hoaxed or otherwise influenced.

"What's grey and what's not depends on your point of view," said Danny Sullivan, editor of the Search Engine Watch website. He said the relationship with SEOs is an awkward but vital one for search engines - and that some search manipulation is done by people "who don't know what they're doing". "They buy tools just like people buy guns who don't know what they're doing," he said.

Google made headlines last week after a $1bn (£560m) deal to buy a stake in internet service provider AOL. Reports suggesting the company could include more intrusive adverts on its site have angered those who believe its clean approach is integral to its success.

Five ways to get to the top

Status on Google is determined by a number of factors, all of which can be faked

Key words

Good practitioners will make sure sites contain clear information that is relevant to a user search. Others will use misleading but popular keywords - such as "Britney Spears" - to try to capitalise on somebody else's fame. Some even attempt to hide fake keywords on a page so that they can be read by search engines but not by people

example:POLITICAL FRIENDSTER.com
: Search for Keywords:
You can use AND to define words which must be in the results, OR to define words which may be in the result and NOT to define words which should not be in the result. Use * as a wildcard for partial matches

Popularity

The more people that link to a site, the more popular it is in Google's mind. By carefully choosing who to link to and where to place those links, SEOs can push a target website up the rankings. Some shady operators even create a fake ecology of websites which all point at each other

Spam

Spamming is a tactic employed by unscrupulous SEOs, and attempts to raise profile and popularity by leaving fake messages pointing towards the target across thousands of other sites and weblogs. While unpopular with surfers, it often boosts the ranking of the site in question

Regular updates

Sites which seem new are often considered more important, because they are more likely to contain relevant information. Unscrupulous operators will often steal content from other pages to create the appearance of movement

Metadata

Each web page carries a selection of unseen information that tells other programs what its contents are. While most SEOs simply include correct information about a given page, crooked operators will use unrelated terms to try to direct unwitting surfers

The Guardian Dec.21 2005