Google
are clearly gathering information about us but refuse
to tell us why. It's nothing new to us, but while
they cannot control normal SERPs, they do however
control who is viewing what and when.
We run
Google adverts in order to survive. Does this mean we
shouldn't share the information below? If you know an
alternative way to sustain costs please get in touch.
Please note, Google does not track you by simply
viewing pages containing their adverts.
1. Google's
immortal cookie:
Google was the first search engine to use a cookie
that expires in 2038. This was at a time when federal
websites were prohibited from using persistent
cookies altogether. Now it's years later, and
immortal cookies are commonplace among search engines
; Google set the standard because no one
bothered to challenge them. This cookie places a
unique ID number on your hard disk. Anytime you land
on a Google page, you get a Google cookie if you
don't already have one. If you have one, they read
and record your unique ID number.
2. Google records everything they can:
For all searches they record the cookie ID, your
Internet IP address, the time and date, your search
terms, and your browser configuration. Increasingly,
Google is customizing results based on your IP
number. This is referred to in the industry as
"IP delivery based on geolocation."
3. Google retains all data indefinitely:
Google has no data retention policies. There is
evidence that they are able to easily access all the
user information they collect and save.
4. Google won't say why they need this data:
Inquiries to Google about their privacy policies are
ignored. When the New York Times (2002-11-28)
asked Sergey Brin about whether Google ever gets
subpoenaed for this information, he had no comment.
5. Google hires spooks:
Matt Cutts, a key Google engineer, used to work for
the National Security Agency. Google wants to hire
more people with security clearances, so that they
can peddle their corporate assets to the spooks in
Washington.
6. Google's toolbar is spyware:
With the advanced features enabled, Google's free
toolbar for Explorer phones home with every page you
surf, and yes, it reads your cookie too. Their
privacy policy confesses this, but that's only
because Alexa lost a class-action lawsuit when their
toolbar did the same thing, and their privacy policy
failed to explain this. Worse yet, Google's toolbar
updates to new versions quietly, and without asking.
This means that if you have the toolbar installed,
Google essentially has complete access to your hard
disk every time you connect to Google (which is many
times a day). Most software vendors, and even
Microsoft, ask if you'd like an updated version. But
not Google. Any software that updates automatically
presents a massive security risk, see: http://schram.net/articles/updaterisk.html
7. Google's cache copy is illegal:
Judging from Ninth Circuit precedent on the
application of U.S. copyright laws to the Internet,
Google's cache copy appears to be illegal. The only
way a webmaster can avoid having his site cached on
Google is to put a "noarchive" meta in the
header of every page on his site. Surfers like the
cache, but webmasters don't. Many webmasters have
deleted questionable material from their sites, only
to discover later that the problem pages live merrily
on in Google's cache. The cache copy should be
"opt-in" for webmasters, not
"opt-out."
8. Google is not your friend:
By now Google enjoys a 75 percent monopoly for all
external referrals to most websites. Webmasters
cannot avoid seeking Google's approval these days,
assuming they want to increase traffic to their site.
If they try to take advantage of some of the known
weaknesses in Google's semi-secret algorithms, they
may find themselves penalized by Google, and their
traffic disappears. There are no detailed, published
standards issued by Google, and there is no appeal
process for penalized sites. Google is completely
unaccountable. Most of the time Google doesn't even
answer email from webmasters.
9. Google is a privacy time bomb:
With 200 million searches per day, most from outside
the U.S., Google amounts to a privacy disaster
waiting to happen. Those newly-commissioned
data-mining bureaucrats in Washington can only dream
about the sort of slick efficiency that Google has
already achieved.
google-watch
April 04, 2004
The Secret Source of Google's Power
Much is being written about Gmail, Google's new
free webmail system. There's something deeper to
learn about Google from this product than the initial
reaction to the product features, however. Ignore for
a moment the observations about Google leapfrogging
their competitors with more user value and a new
feature or two. Or Google diversifying away from
search into other applications; they've been doing
that for a while. Or the privacy red herring.
No, the story is about seemingly incremental
features that are actually massively expensive for
others to match, and the platform that
Google is building which makes it cheaper and easier
for them to develop and run web-scale applications
than anyone else.
I've written before (see Table No1. below) about
Google's snippet service, which required that they
store the entire web in RAM. All so they could
generate a slightly better page excerpt than other
search engines.
Google has taken the last 10 years of systems
software research out of university labs, and built
their own proprietary, production quality system.
What is this platform that Google is building? It's a
distributed computing platform that can manage
web-scale datasets on 100,000 node server clusters.
It includes a petabyte, distributed, fault tolerant
filesystem, distributed RPC code, probably network
shared memory and process migration. And a datacenter
management system which lets a handful of ops
engineers effectively run 100,000 servers. Any of
these projects could be the sole focus of a startup.
Speculation: Gmail's Architecture and Economics
Let's make some guesses about how one might build
a Gmail.
Hotmail has 60 million users. Gmail's design
should be comparable, and should scale to 100 million
users. It will only have to support a couple of
million in the first year though.
The most obvious challenge is the storage. You
can't lose people's email, and you don't want to ever
be down, so data has to be replicated. RAID is no
good; when a disk fails, a human needs to replace the
bad disk, or there is risk of data loss if more disks
fail. One imagines the old ENIAC technician running
up and down the isles of Google's data center with a
shopping cart full of spare disk drives instead of
vacuum tubes. RAID also requires more expensive
hardware -- at least the hot swap drive trays. And
RAID doesn't handle high availability at the server
level anyway.
No. Google has 100,000 servers.(NYTimes)[nytimes]
If a server/disk dies, they leave it dead in the
rack, to be reclaimed/replaced later. Hardware
failures need to be instantly routed around by
software.
Google has built their own distributed,
fault-tolerant, petabyte filesystem, the Google
Filesystem. This is ideal for the job. Say GFS
replicates user email in three places; if a disk or a
server dies, GFS can automatically make a new copy
from one of the remaining two. Compress the email for
a 3:1 storage win, then store user's email in three
locations, and their raw storage need is
approximately equivalent to the user's mail size.
The Gmail servers wouldn't be top-heavy with lots
of disk. They need the CPU for indexing and page view
serving anyway. No fancy RAID card or hot-swap trays,
just 1-2 disks per 1U server.
It's straightforward to spreadsheet out the
economics of the service, taking into account average
storage per user, cost of the servers, and
monetization per user per year. Google apparently
puts the operational cost of storage at $2 per
gigabyte. My napkin math comes up with numbers in the
same ballpark. I would assume the yearly monetized
value of a webmail user to be in the $1-10 range.
Cheap Hardware
Here's an anecdote to illustrate how far Google's
cultural approach to hardware cost is different from
the norm, and what it means as a component of their
competitive advantage.
In a previous job I specified 40 moderately-priced
servers to run a new internet search site we were
developing. The ops team overrode me; they wanted 6
more expensive servers, since they said it would be
easier to manage 6 machines than 40.
What this does is raise the cost of a CPU second.
We had engineers that could imagine algorithms that
would give marginally better search results, but if
the algorithm was 10 times slower than the current
code, ops would have to add 10X the number of
machines to the datacenter. If you've already got $20
million invested in a modest collection of Suns,
going 10X to run some fancier code is not an option.
Google has 100,000 servers.
Any sane ops person would rather go with a fancy
$5000 server than a bare $500 motherboard plus disks
sitting exposed on a tray. But that's a 10X
difference to the cost of a CPU cycle. And this frees
up the algorithm designers to invent better stuff.
Without cheap CPU cycles, the coders won't even
consider algorithms that the Google guys are
deploying. They're just too expensive to run.
Google doesn't deploy bare motherboards on exposed
trays anymore; they're on at least the fourth
iteration of their cheap hardware platform. Google
now has an institutional competence building and
maintaining servers that cost a lot less than the
servers everyone else is using. And they do it with
fewer people.
Think of the little internal factory they must
have to deploy servers, and the level of automation
needed to run that many boxes. Either network boot or
a production line to pre-install disk images. Servers
that self-configure on boot to determine their
network config and load the latest rev of the
software they'll be running. Normal datacenter ops
practices don't scale to what Google has.
What are all those OS Researchers doing at
Google?
Robert Pikehas gone to Google. Yes, that Rob Pike(Please
see the following page on this issue of The Handstand
re. Dis the Inferno Virtual Machine. JB,editor)--
the OS researcher, the member of the original Unix
team from Bell Labs. This guy isn't just some labs
hood ornament; he writes code, lots of it. Big chunks
of whole new operating systems like Plan 9 http://blog.topix.net/archives/000011.html
Table No.1.Example
from paper on Plan 9 A while
ago I was chatting with my old boss Wade
about a nifty algorithm I found for
incremental search engines, which piggybacked
queued writes onto reads that the front end
requests were issuing anyway, to minimize
excess disk head seeks. I thought it was
pretty cool.
Wade smacked me on the head (gently) and
asked why I was even thinking about disk
anymore. Disk is dead; just put the whole
thing in RAM and forget about it, he said.
Orkut is wicked fast; Friendster isn't.
How do you reliably make a scalable web
service wicked fast? Easy: the whole thing
has to be in memory, and user requests must
never wait for disk.
A disk head seek is about 9ms, and the
human perceptual threshold for what seems
"instant" is around 50ms. So if you
have just one head seek per user request, you
can support at most 5 hits/second on that
server before users start to notice
latency. If you have a typical filesystem
with a little database on top, you may be up
to 3+ seeks per hit already. Forget caching;
caching helps the second user, and doesn't
work on systems with a "long tail"
of zillions of seldom-accessed queries, like
search.
................................................The
biggest RAM database of all...
An overlooked feature that made Google
really cool in the beginning was their
snippets. This is the excerpt of text that
shows a few sample sentences from each web
page matching your search. Google's snippets
show just the part of the web page that have
your search terms in them; other search
engines before always showed the same couple
of sentences from the start of the web page,
no matter what you had searched for.
Consider the insane cost to implement this
simple feature. Google has to keep a copy of
every web page on the Internet on their
servers in order to show you the piece of the
web page where your search terms hit.
Everything is served from RAM, only booted
from disk. And they have multiple separate
search clusters at their co-locations. This
means that Google is currently storing
multiple copies of the entire web in RAM.
My napkin is hard to read with all these
zeroes on it, but that's a lot of memory.
Talk about barrier to entry.
Posted by skrenta at February 2, 2004
Topix.net weblog
|
Look at the depth of the research background(see
table insert below for url.)of the Google
employees in OS, networking, and distributed systems.
Compiler Optimization. Thread migration. Distributed
shared memory.
Table No.2.
List of Google Research Papers: http://labs.google.com/papers.html#os
Below is a partial list of papers written
by people now at Google, showing the range of
backgrounds of people in Google Engineering.
|
I'm a sucker for cool OS research. Browsing papers
from Google employees about distributed systems,
thread migration, network shared memory, GFS, makes
me feel like a kid in Tomorrowland wondering when
we're going to Mars. Wouldn't it be great, as an
engineer, to have production versions of all this
great research.
Google engineers do!
Competitive Advantage
Google is a company that has built a single very
large, custom computer. It's running their own
cluster operating system. They make their big
computer even bigger and faster each month, while
lowering the cost of CPU cycles. It's looking more
like a general purpose platform than a cluster
optimized for a single application.
While competitors are targeting the individual
applications Google has deployed, Google is building
a massive, general purpose computing platform for
web-scale programming.
This computer is running the world's top search
engine, a social networking service, a shopping price
comparison engine, a new email service, and a local
search/yellow pages engine. What will they do next
with the world's biggest computer and most advanced
operating system?
Posted by skrenta at April 4, 2004 02:11 PM
meantime dear friends a piece of Google propaganda
arrives on the desk from The Guardian:
How
unscrupulous firms are manipulating world's leading
search engine
Bobbie
Johnson, technology correspondent
Wednesday December 21, 2005
The Guardian
It is known as the Google dance, a delicate
struggle between technicians at the world's largest
internet search engine and the spin doctors who
manipulate the worldwide web for commercial ends.
Every day one group tries to prevent the other from
abusing Google's index of more than 8bn web pages.
The stakes are high. As Britain's online shoppers
spend £150m every day in the run-up to Christmas,
the value of a high ranking on Google is potentially
worth millions to retailers, website owners and
criminals alike. Web search has become the focus for
a legion of marketers over the past decade, and their
tactics - known as search engine optimisation (SEO) -
are the basis of a multibillion pound worldwide
industry. Most are legitimate businesses, but some
so-called "black hat" SEOs use unethical
strategies to boost their clients.
Spoof site
To test the effectiveness of these tactics, the
Guardian created a spoof site and tried to force it
up Google's rankings. Over one week, a number of
tricks - some similar to those used by black-hat
firms - were used to successfully push it to the top.
The spoof site was set up to promote eco-friendly
flip-flops, a bogus product promising zero harmful
emissions. The simple page featured a disclaimer to
make the nature of the experiment clear, and a
picture of the goods. At the start of the experiment,
there were more than 11,500 results for
"eco-friendly flip-flops" on Google, and
the spoof site did not feature. Within two days of
creating the site, Google's spider - the program that
explores the web - had discovered the site and
included it in its main index, but it appeared within
the lowest 100 pages. The first attempt to boost the
ranking was a series of basic instructions intended
to manipulate Google, including overloading the page
with words that would improve the site's ranking, and
adding invisible data intended to boost it even
further. This had little effect, however, and the
spoof site remained static in Google's index.
Another trick was then used to mimic black-hat
behaviour. A second site was created which contained
a large number of links to the first. Because Google
rates the authority of a site partly by how many
times they have been linked to, this ploy can makes a
site appear popular. Within hours, the effect was
apparent - the spoof site was now the top result in
our test search, trumping the other 11,500 sites
within days.
Our experiment was small scale and limited in
scope, but in the real world the value of success is
higher than ever. Black-hat companies offer a range
of services at different prices, all aimed at
unfairly manipulating search engines. One website
found by the Guardian offered customers the chance to
download a program to spam Google for $99 (£56),
while another - posing as a legitimate SEO -charged
several thousand pounds.
Commercial power
Customers trust the results of search engines, which
rely on advertising to generate profit and have much
to gain from keeping the results clean and everything
to lose if they fail. Almost a fifth of all visits to
online shopping sites are the direct result of an
internet search, according to data monitor Hitwise -
and more than half of those come straight from
Google. This places an astonishing level of
commercial power in the hands of one company. Google
says it takes this very seriously, but it is an
almost impossible task given the amount of
information concerned.
"There is a lot of ongoing algorithmic work
to improve the relevancy of our results in
general," said Marcus Sandberg, a Google
engineer. "Many SEOs do a great job at helping
site owners improve their content for users and
search engines alike. But some do use methods that we
consider manipulative."
Behind Google's legendary algorithm - an immense
secret mathematical formula for ranking websites -
are a series of well-known principles that can be
faked, hoaxed or otherwise influenced.
"What's grey and what's not depends on your
point of view," said Danny Sullivan, editor of
the Search Engine Watch website. He said the
relationship with SEOs is an awkward but vital one
for search engines - and that some search
manipulation is done by people "who don't know
what they're doing". "They buy tools just
like people buy guns who don't know what they're
doing," he said.
Google made headlines last week after a $1bn
(£560m) deal to buy a stake in internet service
provider AOL. Reports suggesting the company could
include more intrusive adverts on its site have
angered those who believe its clean approach is
integral to its success.
Five ways to get to the top
Status on Google is determined by a number of
factors, all of which can be faked
Key words
Good practitioners will make sure sites contain
clear information that is relevant to a user search.
Others will use misleading but popular keywords -
such as "Britney Spears" - to try to
capitalise on somebody else's fame. Some even attempt
to hide fake keywords on a page so that they can be
read by search engines but not by people
example:POLITICAL
FRIENDSTER.com
: Search for Keywords:
You can use AND to define words which
must be in the results, OR to define
words which may be in the result and NOT
to define words which should not be in the
result. Use * as a wildcard for partial
matches |
Popularity
The more people that link to a site, the more
popular it is in Google's mind. By carefully choosing
who to link to and where to place those links, SEOs
can push a target website up the rankings. Some shady
operators even create a fake ecology of websites
which all point at each other
Spam
Spamming is a tactic employed by unscrupulous
SEOs, and attempts to raise profile and popularity by
leaving fake messages pointing towards the target
across thousands of other sites and weblogs. While
unpopular with surfers, it often boosts the ranking
of the site in question
Regular updates
Sites which seem new are often considered more
important, because they are more likely to contain
relevant information. Unscrupulous operators will
often steal content from other pages to create the
appearance of movement
Metadata
Each web page carries a selection of unseen
information that tells other programs what its
contents are. While most SEOs simply include correct
information about a given page, crooked operators
will use unrelated terms to try to direct unwitting
surfers
The Guardian Dec.21 2005