The digital age generates reams of raw data. Much of that data is interesting or important, but since there’s a lot of it out there it’s often hard to find and analyze. This is where journalists can help. Journalists are experts at delving into complex issues and writing stories that make them accessible—essential skills for dealing with the data deluge of the digital age. In order to support and encourage innovative data journalism, we’re sponsoring a series of prizes all across Europe.
Let’s start in the Nordics, where we recently partnered with Danish newspaper Dagbladet Information and Southern Denmark University’s Center for Journalism to sponsor the Nordic News Hacker 2012 contest. Contestants were asked to create and submit a piece of data journalism—anything from a data mash-up to a new mobile app.
This year’s winner is Anders Pedersen. Ander’s project, Doctors for Sale, inspired by Pro Publica’s Docs for Dollars investigation in the United States, used raw data to uncover doctors who receive money from the pharmaceutical industry. He wins a $20,000 scholarship to work with the Guardian Data Blog in London for one month to further his investigative skills. Several thousand kilometers south of Denmark at the International Journalism Festival, the Global Editors Network announced the 60 shortlisted projects for the Google-sponsored Data Journalism Awards. Some 320 projects were submitted from a diverse group of applicants including major media groups, regional newspapers, press associations, and entrepreneurial journalists from more than 60 countries. Six winners will be announced during the News World Summit, on May 31, 2012 in Paris. In Vienna, the International Press Institute recently announced the winners of their News Innovation contest, sponsored by Google. Fourteen projects were selected, including digital training in the Middle East, corruption chasing in the Balkans, and citizen photojournalism in the UK. All use digital data and new technologies to tell stories or reach new audiences. The winners received a total of more than $1.7 million. Congratulations to all the journalists and publications who are embracing the digital world! Posted by Peter Barron, Director, External Relations Europe Middle East and Africa
In 2001, Google started providing a service that could translate eight languages to and from English. It used what was then state-of-the-art commercial machine translation (MT), but the translation quality wasn’t very good, and it didn’t improve much in those first few years. In 2003, a few Google engineers decided to ramp up the translation quality and tackle more languages. That's when I got involved. I was working as a researcher on DARPA projects looking at a new approach to machine translation—learning from data—which held the promise of much better translation quality. I got a phone call from those Googlers who convinced me (I was skeptical!) that this data-driven approach might work at Google scale.
I joined Google, and we started to retool our translation system toward competing in the NIST Machine Translation Evaluation, a “bake-off” among research institutions and companies to build better machine translation. Google’s massive computing infrastructure and ability to crunch vast sets of web data gave us strong results. This was a major turning point: it underscored how effective the data-driven approach could be.
But at that time our system was too slow to run as a practical service—it took us 40 hours and 1,000 machines to translate 1,000 sentences. So we focused on speed, and a year later our system could translate a sentence in under a second, and with better quality. In early 2006, we rolled out our first languages: Chinese, then Arabic.
We announced our statistical MT approach on April 28, 2006, and in the six years since then we’ve focused primarily on core translation quality and language coverage. We can now translate among any of 64 different languages, including many with a small web presence, such as Bengali, Basque, Swahili, Yiddish, even Esperanto.
Today we have more than 200 million monthly active users on translate.google.com (and even more in other places where you can use Translate, such as Chrome, mobile apps, YouTube, etc.). People also seem eager to access Google Translate on the go (the language barrier is never more acute than when you’re traveling)—we’ve seen our mobile traffic more than quadruple year over year. And our users are truly global: more than 92 percent of our traffic comes from outside the United States.
In a given day we translate roughly as much text as you’d find in 1 million books. To put it another way: what all the professional human translators in the world produce in a year, our system translates in roughly a single day. By this estimate, most of the translation on the planet is now done by Google Translate. (We can’t speak for the galaxy; Douglas Adams’s “Babel fish” probably has us beat there.) Of course, for nuanced or mission-critical translations, nothing beats a human translator—and we believe that as machine translation encourages people to speak their own languages more and carry on more global conversations, translation experts will be more crucial than ever.
We imagine a future where anyone in the world can consume and share any information, no matter what language it’s in, and no matter where it pops up. We already provide translation for webpages on the fly as you browse in Chrome, text in mobile photos, YouTube video captions, and speech-to-speech “conversation mode” on smartphones. We want to knock down the language barrier wherever it trips people up, and we can’t wait to see what the next six years will bring.
Posted by Franz Och, Distinguished Research Scientist, Google Translate
What do a former violent jihadist from Indonesia, an ex-neo-Nazi from Sweden and a Canadian who was held hostage for 15 months in Somalia have in common? In addition to their past experiences with radicalization, they are all also members of Against Violent Extremism (AVE), a new online network that is launching today from the Institute for Strategic Dialogue (ISD) with support from our think/do tank Google Ideas, the Gen Next Foundation and other partners. This is the first time that former extremists, survivors, nonprofits and private sector leaders from around the world are combining forces and using online tools to tackle the problem of violent extremism.
The idea for this network first came about last summer when we hosted the Summit Against Violent Extremism in Dublin. We wanted to initiate a global conversation on how best to prevent youth from becoming radicalized. In some ways, it was a bit of an experiment to see if we could get so-called “formers”—those who had renounced their previous lives of violent extremism—and survivors of such violence to come together in one place.
To reframe the issue of counter-radicalization, we decided to spotlight formers as positive role models for youth. We also knew that there has traditionally been an over-reliance on governments to tackle these problems, so we wanted to see what diverse groups outside the public sector could offer. Finally, we needed to go beyond the in-person, physical conversations we had at the summit into the realm of the virtual, using the Internet to ensure sustained discussion and debate.
Until now, there has never before been a one-stop shop for people who want to help fight these challenges—a place to connect with others across sectors and disciplines to get expertise and resources. The AVE web platform contains tools for those wanting to act on this issue, forums for dialogue, and information about the projects that the network has spawned. The site, which is in beta, will be managed by ISD, a London-based think tank that has long worked on issues surrounding radicalization. AVE’s seed members are a global network of formers, survivors of violent extremism, NGOs, academics, think tanks and private sector execs—all with a shared goal of preventing youth from becoming radicalized. You can hear from some of the participants in this video here:
Working with the formers over the past several months has turned out to be an exploration of a kind of illicit network: violent extremism. But it’s touched on other types of illicit networks too—such as drug smuggling, human trafficking and the underground arms trade. With the launch of the AVE network, we plan to turn much of our attention over the next several months to these other areas. This afternoon as part of the Tribeca Film Festival, I will be moderating a panel discussion, Illicit Networks: Portrayal Through Film, talking to a former child soldier, a farm laborer who’s gone undercover to investigate modern-day slavery, a survivor of trafficking and abuse, and a former arms broker. We’ll be watching various movie clips and discussing what people learn from Hollywood when it comes to the mysterious and misunderstood world of illicit networks.
This will be an early look at what’s to come this summer when we will again partner with Tribeca Enterprises and the Council on Foreign Relations (as we did last year in Dublin) to convene the Illicit Networks: Forces in Opposition (INFO) Summit. We plan to bring together a diverse cross-section of activists, survivors, policymakers and engineers to come up with creative ideas about how technology can disrupt some of the world’s most dangerous illicit networks. We want to look not only at how technology has been part of the problem, but how it can be part of the solution by empowering those who are adversely affected by illicit networks. We look forward to sharing with you what we learn.
Posted by Jared Cohen, Director, Google Ideas
This is the first in a series of posts that will provide greater transparency about how we make our ads safer by detecting and removing scam ads. -Ed.
A few weeks ago, we posted here about our efforts in fighting bad ads, and we shared a video with the basics of how we do it. Today I wanted to delve a little deeper and give some insight into the systems we use to help prevent bad ads from showing. Our ads policies are designed with safety and trust in mind—we don’t allow ads for malicious downloads, counterfeit goods, or ads with unclear billing practices, to name a few examples. In order to help prevent these kinds of ads from showing, we use a combination of automated systems and human input to review the billions of ads submitted to Google each year. I’m one of many engineers whose job is to help make sure that Google doesn’t show bad ads to users.
We’ve designed our approach based on a three-pronged strategy, each focused on a different dimension of the problem: ads, sites, and advertiser accounts. These systems are complementary, sharing signals among each other so that we can comprehensively attack bad ads.
For example, in the case of a site that is selling counterfeit goods, this three-pronged approach aims to look for patterns that would flag such a site and help prevent ads from showing. Ad review notices patterns in the ads and keywords selected by the advertiser. Site review analyzes the entire site to determine if it is selling counterfeit goods. Account review aims to determine if a new advertiser is truly new, or is simply a repeat offender trying to abuse Google’s advertising system. Here’s more detail on how we review each of these three components.
Ad Review An ad is the snippet of information presented to a user, along with a link to a specific webpage, or landing page. The ads review system inspects individual ads and landing pages, and is probably the system most familiar to advertisers. When an advertiser submits an ad, our system immediately performs a preliminary examination. If there’s nothing in the ad that flags a need for further review, we tell the advertiser the ad is “Eligible” and show the ad only on google.com to users who have SafeSearch turned off. If the ad is flagged for further review, in most cases we refer to the ad as “Under Review” and don’t show the ad at all. From there, the ad enters our automated pipeline, where we employ machine learning models, a rules engine and landing page analysis to perform a more extensive examination. If our automated system determines an outcome with a high degree of confidence, we will either approve the ad to run on Google and all of our partners (“Approved”), approve the ad to show for appropriate users in specific locations (“Approved - Limited”) or reject the ad (“Disapproved”). If our automated system isn’t able to determine the outcome, we send the ad to a real person to make a final decision.
Site Review A site has many different pages, each of which could be pointed to by different ads, often known as a domain. Our site review system identifies policy issues which apply to the whole site. It aggregates sites across all ads from all advertisers and regularly crawls them, building a repository of information that’s constantly improving as new scams and new sites are examined. We store the content of advertised sites and use both machine learning models and a rules engine to analyze the sites. The magic of the site review system is it understands the structure of language on webpages in order to classify the content of sites. Site review will determine whether or not an entire site should be disabled, which would prevent any ads leading to that site showing from any account. When the automated system isn’t able to determine the outcome with a high degree of confidence, we send it to a real person to make a decision. When a site is disabled, we tell the advertiser that it’s in violation of “Site Policy.”
Account Review An account is one particular advertiser’s collection of ads, plus the advertiser’s selections for targeting and bidding on those ads. An account may have many ads which may point to several different sites, for example. The account review system constantly evaluates individual advertiser accounts to determine if the whole account should be inspected and shut down for policy violations. This system “listens” to a variety of signals, such as ads and keywords submitted by the advertiser, budget changes, the advertiser’s address and phone number, the advertiser’s IP address, disabled sites connected to this account, and disapproved ads. The system constantly re-evaluates all accounts, incorporating new data. For example, if an advertiser logs in from a new IP address, the account is re-evaluated to determine if that new signal suggests we should take a closer look at the content of the advertiser’s account. If the account review system determines that there is something suspect about a particular account with a high degree of confidence, it automatically suspends the account. If the system isn’t sure, it stops the account from showing any ads at all and asks a real person to decide if the account should be suspended.
Even with all these systems and people working to stop bad ads, there still can be times when an ad slips through that we don’t want. There are many malicious players who are very persistent—they seek to abuse Google’s advertising system in order to take advantage of our users. When we shut down a thousand accounts, they create two thousand more using different patterns. It’s a never-ending game of cat and mouse.
We’ve put a great deal of effort and expense into building these systems because Google’s long-term success is based on the trust of people who use our products. I’ve focused my time and energy in this area for many years. I find it inspiring to fight the good fight, to focus on the user, and do everything we can to help prevent bad ads from running. I’ll continue to post here from time to time with additional thoughts and greater information about how we make ads safer by detecting and removing scam ads.
Posted by David W. Baker, Director of Engineering, Advertising