Indy писали индусы?

← →
Maacheba (2009-02-09 11:49) [0]

Ищу функцию преобразования формата URL"а в ANSI строку (string). В интернете куча примеров, где-то даже ассемблер используется, но я не понимаю смысла кода, ощущение что он работает только в конкретных случаях избранных.
Нужны функции кодировку туда и обратно, как это делают браузеры при GET запросах.

Посмотрел Indy, там такое:

class function TIdURI.URLDecode(ASrc: string): string; var i: integer; ESC: string[2]; CharCode: integer; begin Result := ""; {Do not Localize} ASrc := StringReplace(ASrc, "+", " ", [rfReplaceAll]); {do not localize} i := 1; while i <= Length(ASrc) do begin if ASrc[i] <> "%" then begin {do not localize} Result := Result + ASrc[i] end else begin Inc(i); // skip the % char ESC := Copy(ASrc, i, 2); // Copy the escape code Inc(i, 1); // Then skip it. try CharCode := StrToInt("$" + ESC); {do not localize} if (CharCode > 0) and (CharCode < 256) then begin Result := Result + Char(CharCode); end; except end; end; Inc(i); end; end;

Во-первых, полностью не учитываются национальные алфавиты, как я понимаю. То есть, допустим русскую букву "а" он не закодирует правильно (как %D0%B0).

Во-вторых, особо понравилось:

ESC := Copy(ASrc, i, 2); // Copy the escape code ... CharCode := StrToInt("$" + ESC); {do not localize} if (CharCode > 0) and (CharCode < 256) then begin Result := Result + Char(CharCode);

Нафига проверять на "<256", если двухсимвольный HEX не может дать даже в теории значение более 255?!

В общем, если у кого надежные, проверенные, НОРМАЛЬНЫЕ функции преобразования строки в формат URL и обратно?

← →
Maacheba (2009-02-09 11:51) [1]

И еще меня тут терзают смутные сомнения. Я как посмотрю, символ пробела заменяется на знак "+". Есть ли еще такие спец. подстановки нестандартые?

И я точно помню, что где-то пробел заменяется на "%20" - где такое? Я запутался... Или есть два формата кодирования URL"а, юникодный и ANSI?

← →
Ega23 © (2009-02-09 11:55) [2]

JavaScript
document.write(escape("Visit W3Schools!") + "<br />"); document.write(escape("?!=()#%&"));

Результат
Visit%20W3Schools%21 %3F%21%3D%28%29%23%25%26

← →
Rouse_ © (2009-02-09 12:00) [3]

InternetCanonicalizeUrl / UrlCanonicalize ?

← →
Maacheba (2009-02-09 12:35) [4]

Во, в википедии нормальное описание: http://ru.wikipedia.org/wiki/Url

Я одного не понял, как это:

"Следует отметить, что MediaWiki избегает кодирования пробела как %20, вместо этого он везде заменяется символом подчёркивания «_». Многие поисковики заменяют пробел на символ «+». "

То есть, пробел это все таки %20? А зачем тогда поисковики заменяют его на "+" (впрочем, судя по всему, чтобы визуально казалось объединением типа "delphi+vcl", а не "delphi%20vcl") и самое главное как?! В яндексе еще могу понять, у них там форма перегруженная:

<form action="http://yandex.ru/yandsearch" onsubmit="var clid=location.href.match(/clid=(\d+)/);location.href=this.action+"?rpt=rad&text="+encodeURIComponent(this.text.value)+(clid?"&"+clid[0]:"");return false">

но в то же гугле ничего такого нету:

<form action="/search" name=f> ... <input autocomplete="off" maxlength=2048 name=q size=55 title="Поиск в Google" value="">

где тут преобразование? А тем не менее в гугле тоже вместо пробела в виде %20 стоит "+"

← →
clickmaker © (2009-02-09 12:38) [5]

дотнетовская UrlEncode тоже заменяет пробел на +
совсем недавно на это напоролся: браузер не понял + в урле

← →
brother © (2009-02-09 12:46) [6]

> То есть, пробел это все таки %20?

это 100% ;)

> А зачем тогда поисковики заменяют его на "+"

имхо их парсер интерпретирует это как связку И, а пробел просто как символ...

← →
KSergey © (2009-02-09 12:49) [7]

> clickmaker © (09.02.09 12:38) [5]
> дотнетовская UrlEncode тоже заменяет пробел на +
> совсем недавно на это напоролся: браузер не понял + в урле

Вероятно штука вот в чем: в пределах имени хоста (доменное имя) все равно пробел недопустим, а потом не важно как закодировать - лишь бы пользоваться парными функциями.
Правда что делать с символом + мне не очень понятно :)

PS
попробовал, гугль кодирует как %2B

← →
Maacheba (2009-02-09 12:50) [8]

Тааак, ничего не понимаю... А при этом на данном сайте буква русская "а" кодируется как %E0, пример:

http://www.delphimaster.ru/cgi-bin/forum.pl?n=3&search=%E0

← →
brother © (2009-02-09 12:52) [9]

Ты не забывай, что кодировать одно и то-же можно по-разному ;)

← →
Maacheba (2009-02-09 12:52) [10]

> имхо их парсер интерпретирует это как связку И, а пробел
> просто как символ...

причем здесь "ИХ" парсер, если в таком формате сам браузер отправляет GET параметры?

← →
brother © (2009-02-09 12:53) [11]

> причем здесь "ИХ" парсер, если в таком формате сам браузер
> отправляет GET параметры?

отправляет кому?)))

← →
Maacheba (2009-02-09 13:02) [12]

А, это еще зависит от кодировки страницы! Если она ANSI (типа как win-1251) то один формат отправки, если страничка UTF-8 то другое.

Народ, то есть тут нет человека, который это все прекрасно понимает? ))
Мне бы алгоритм дешифровки URL"а, желательно без знания кодировки )

← →
Maacheba (2009-02-09 13:03) [13]

> отправляет кому?)))

кому-кому, серверу гугла, естественно. А точнее, скрипту, который прописан в теге "<form=..." в HTML документе.

← →
brother © (2009-02-09 13:06) [14]

> кому-кому, серверу гугла, естественно. А точнее, скрипту

вот ты и ответил на свой вопрос))))), ведь, что он (скрип итд) с запросом сделает, тебеж не видно)

← →
brother © (2009-02-09 13:06) [15]

имхо стандарт расплылся)

← →
Ega23 © (2009-02-09 13:14) [16]

> Мне бы алгоритм дешифровки URL"а, желательно без знания
> кодировки )

Вычленить все % и два символа за ним. Два символа - Hex-представление. Взять от него Char.

← →
Maacheba (2009-02-09 13:35) [17]

> Два символа - Hex-представление. Взять от него Char

я вот тоже так раньше думал )
А теперь зайти на google.ru или там yandex.ru и введи в поиск букву "ф" например. Увидишь, что она представляется как "%D1%84"

← →
Maacheba (2009-02-09 13:47) [18]

И зависит это от кодировки страницы - UTF-8 или ansi.

Но вот мне интересно, гугл в любом случае определяет правильно. Можно так зайти:

http://www.google.ru/search?q=%EF%F0%E8%E2%E5%F2

А можно так:

http://www.google.ru/search?q=%D0%BF%D1%80%D0%B8%D0%B2%D0%B5%D1%82

Это будут одни и те же запросы. Но как гугл это определяет? Откуда он знает, что в первом случае применяется win-1251 кодировка, а во втором случае UTF-8? Возможно ли это понять из самой структуры Unicode, я как сейчас читаю стандарты - нельзя. Или гугл делает очень хитро:

из заголовка HTTP он определяет страну, потом берет расчет, что строка в ANSI и пытается это проверить. Сравнивает принадлежность перекодированных ANSI символов к национальному алфавиту (саму страну, а значит алфавит берез из HTTP заголовка) и если все в рамках - то делает вывод что строка все таки ANSI. Если же в рамки не входит - то интерпретирует ее как unicode...
У меня других вариантов нету объяснений!

Или я что не понимаю?

← →
antonn © (2009-02-09 13:51) [19]

> Maacheba (09.02.09 13:35) [17]

яндекс для своих нужд может как угодно кодировать строку, это нужно только ему, вероятно так юникод будет зашифрован, а там хоть base64.
а вот urlencode и urldecode должны по стандарту формироваться. Я юзаю функции из Синапса:
type TSpecials = set of AnsiChar; const URLSpecialChar: TSpecials = [ #$00..#$20, "_", "<", ">", """, "%", "{", "}", "|", "\", "^", "~", "[", "]", "`", #$7F..#$FF]; function EncodeTriplet(const Value: AnsiString; Delimiter: AnsiChar; Specials: TSpecials): AnsiString; var n, l: Integer; s: AnsiString; c: AnsiChar; begin SetLength(Result, Length(Value) * 3); l := 1; for n := 1 to Length(Value) do begin c := Value[n]; if c in Specials then begin Result[l] := Delimiter; Inc(l); s := IntToHex(Ord(c), 2); Result[l] := s[1]; Inc(l); Result[l] := s[2]; Inc(l); end else begin Result[l] := c; Inc(l); end; end; Dec(l); SetLength(Result, l); end; function EncodeURL(const Value: AnsiString): AnsiString; begin Result := EncodeTriplet(Value, "%", URLSpecialChar); end; function DecodeTriplet(const Value: AnsiString; Delimiter: AnsiChar): AnsiString; var x, l, lv: Integer; c: AnsiChar; b: Byte; bad: Boolean; begin lv := Length(Value); SetLength(Result, lv); x := 1; l := 1; while x <= lv do begin c := Value[x]; Inc(x); if c <> Delimiter then begin Result[l] := c; Inc(l); end else if x < lv then begin Case Value[x] Of #13: if (Value[x + 1] = #10) then Inc(x, 2) else Inc(x); #10: if (Value[x + 1] = #13) then Inc(x, 2) else Inc(x); else begin bad := False; Case Value[x] Of "0".."9": b := (Byte(Value[x]) - 48) Shl 4; "a".."f", "A".."F": b := ((Byte(Value[x]) And 7) + 9) shl 4; else begin b := 0; bad := True; end; end; Case Value[x + 1] Of "0".."9": b := b Or (Byte(Value[x + 1]) - 48); "a".."f", "A".."F": b := b Or ((Byte(Value[x + 1]) And 7) + 9); else bad := True; end; if bad then begin Result[l] := c; Inc(l); end else begin Inc(x, 2); Result[l] := AnsiChar(b); Inc(l); end; end; end; end else break; end; Dec(l); SetLength(Result, l); end; function DecodeURL(const Value: AnsiString): AnsiString; begin Result := DecodeTriplet(Value, "%"); end;

← →
Maacheba (2009-02-09 13:54) [20]

кстати, по этой логике и JS функция escape должна возвращать результат в зависимости от того, какая кодировка указана в тегах страницы, где эта функция вызывается?!

← →
Ega23 © (2009-02-09 14:02) [21]

Вот что мне FireBug пишет

Response Headers Date Mon, 09 Feb 2009 10:59:53 GMT Server Apache/2.2.4 (Win32) mod_fastcgi/2.4.6 Keep-Alive timeout=5, max=93 Connection Keep-Alive Transfer-Encoding chunked Content-Type text/html;

Request Headers Host localhost User-Agent Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.8.1.16) Gecko/20080702 Firefox/2.0.0.16 Accept text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5 Accept-Language ru-ru,ru;q=0.8,en-us;q=0.5,en;q=0.3 Accept-Encoding gzip,deflate Accept-Charset windows-1251,utf-8;q=0.7,*;q=0.7 Keep-Alive 300 Connection keep-alive Method POST //localhost/fcgi-bin/fcgisrv.exe HTTP/1.1 Content-Type application/x-www-form-urlencoded Referer http://localhost/sline/databasepage.html?SID={D43E33A7-D187-4D45-BD15-E9AFE17B3A9D} Content-Length 53

← →
antonn © (2009-02-09 14:12) [22]

> Ega23 © (09.02.09 14:02) [21]

ну почему бы ему услужливо не перегнать в подобающий вид, потому и скобки и минусы :)

← →
Anatoly Podgoretsky © (2009-02-09 14:29) [23]

> Maacheba (09.02.2009 11:49:00) [0]

Что ты против индусов имеешь?
Старая и очень умная нация и очень дисциплинированая.

← →
Anatoly Podgoretsky © (2009-02-09 14:30) [24]

> Maacheba (09.02.2009 11:51:01) [1]

А говорил индусы дураки!

← →
Maacheba (2009-02-09 14:43) [25]

>Я юзаю функции из Синапса:

молодец, но эту функция ANSI ориентированная. А бывает еще UTF-8 кодировка, как я понимаю.

> Вот что мне FireBug пишет

а это ты про что? В твоем примере не видно ни GET, ни POST данных.

Я в общем все понял. Путаница с тем, что просто используют 2 кодировки - иногда ASCI (однобайтовую), иногда UTF-8. Браузер отправляет данные в том или ином формате, судя по всему, в зависимости от кодировки страницы.

У меня только один вопрос - каким образом гугл определяет кодировку, так что оба этих запроса работают одинаково:

http://www.google.ru/search?q=%EF%F0%E8%E2%E5%F2 (ASCII)
и
http://www.google.ru/search?q=%D0%BF%D1%80%D0%B8%D0%B2%D0%B5%D1%82 (UTF-8)

← →
Anatoly Podgoretsky © (2009-02-09 15:03) [26]

> Maacheba (09.02.2009 14:43:25) [25]

Понимаешь правильно, а вот выводы делаешь неправильно.
С таким знаниями ничего хорошего не получится, в общем случае какой ни будь уродец слепленый силами нескольких форумов.
Ты не знаешь основы работы Сетей.

← →
iZEN © (2009-02-09 15:24) [27]

/* * @(#)URLDecoder.java 1.28 05/11/17 * * Copyright 2006 Sun Microsystems, Inc. All rights reserved. * SUN PROPRIETARY/CONFIDENTIAL. Use is subject to license terms. */ package java.net; import java.io.*; /** * Utility class for HTML form decoding. This class contains static methods * for decoding a String from the application/x-www-form-urlencoded * MIME format. * <p> * To conversion process is the reverse of that used by the URLEncoder class. It is assumed * that all characters in the encoded string are one of the following: * "a" through "z", * "A" through "Z", * "0" through "9", and * "-", "_", * ".", and "*". The * character "%" is allowed but is interpreted * as the start of a special escaped sequence. * <p> * The following rules are applied in the conversion: * <p> * <ul> * <li>The alphanumeric characters "a" through * "z", "A" through * "Z" and "0" * through "9" remain the same. * <li>The special characters ".", * "-", "*", and * "_" remain the same. * <li>The plus sign "+" is converted into a * space character " " . * <li>A sequence of the form "%xy" will be * treated as representing a byte where xy is the two-digit * hexadecimal representation of the 8 bits. Then, all substrings * that contain one or more of these byte sequences consecutively * will be replaced by the character(s) whose encoding would result * in those consecutive bytes. * The encoding scheme used to decode these characters may be specified, * or if unspecified, the default encoding of the platform will be used. * </ul> * <p> * There are two possible ways in which this decoder could deal with * illegal strings. It could either leave illegal characters alone or * it could throw an <tt>{@link java.lang.IllegalArgumentException}</tt>. * Which approach the decoder takes is left to the * implementation. * * @author Mark Chamness * @author Michael McCloskey * @version 1.28, 11/17/05 * @since 1.2 */ public class URLDecoder { // The platform default encoding static String dfltEncName = URLEncoder.dfltEncName; /** * Decodes a x-www-form-urlencoded string. * The platform"s default encoding is used to determine what characters * are represented by any consecutive sequences of the form * "%xy". * @param s the String to decode * @deprecated The resulting string may vary depending on the platform"s * default encoding. Instead, use the decode(String,String) method * to specify the encoding. * @return the newly decoded String */ @Deprecated public static String decode(String s) { String str = null; try { str = decode(s, dfltEncName); } catch (UnsupportedEncodingException e) { // The system should always have the platform default } return str; } /** * Decodes a application/x-www-form-urlencoded string using a specific * encoding scheme. * The supplied encoding is used to determine * what characters are represented by any consecutive sequences of the * form "%xy". * <p> * <em><strong>Note:</strong> The <a href= * "http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars"> * World Wide Web Consortium Recommendation</a> states that * UTF-8 should be used. Not doing so may introduce * incompatibilites.</em> * * @param s the String to decode * @param enc The name of a supported * <a href="../lang/package-summary.html#charenc">character * encoding</a>. * @return the newly decoded String * @exception UnsupportedEncodingException * If character encoding needs to be consulted, but * named character encoding is not supported * @see URLEncoder#encode(java.lang.String, java.lang.String) * @since 1.4 */ public static String decode(String s, String enc) throws UnsupportedEncodingException{ boolean needToChange = false; int numChars = s.length(); StringBuffer sb = new StringBuffer(numChars > 500 ? numChars / 2 : numChars); int i = 0; if (enc.length() == 0) { throw new UnsupportedEncodingException ("URLDecoder: empty string enc parameter"); } char c; byte[] bytes = null; while (i < numChars) { c = s.charAt(i); switch (c) { case "+": sb.append(" "); i++; needToChange = true; break; case "%": /* * Starting with this instance of %, process all * consecutive substrings of the form %xy. Each * substring %xy will yield a byte. Convert all * consecutive bytes obtained this way to whatever * character(s) they represent in the provided * encoding. */ try { // (numChars-i)/3 is an upper bound for the number // of remaining bytes if (bytes == null) bytes = new byte[(numChars-i)/3]; int pos = 0; while ( ((i+2) < numChars) && (c=="%")) { bytes[pos++] = (byte)Integer.parseInt(s.substring(i+1,i+3),16); i+= 3; if (i < numChars) c = s.charAt(i); } // A trailing, incomplete byte encoding such as // "%x" will cause an exception to be thrown if ((i < numChars) && (c=="%")) throw new IllegalArgumentException( "URLDecoder: Incomplete trailing escape (%) pattern"); sb.append(new String(bytes, 0, pos, enc)); } catch (NumberFormatException e) { throw new IllegalArgumentException( "URLDecoder: Illegal hex characters in escape (%) pattern - " + e.getMessage()); } needToChange = true; break; default: sb.append(c); i++; break; } } return (needToChange? sb.toString() : s); } }

← →
iZEN © (2009-02-09 15:27) [28]

/* * @(#)URLEncoder.java 1.32 06/04/22 * * Copyright 2006 Sun Microsystems, Inc. All rights reserved. * SUN PROPRIETARY/CONFIDENTIAL. Use is subject to license terms. */ package java.net; import java.io.ByteArrayOutputStream; import java.io.BufferedWriter; import java.io.OutputStreamWriter; import java.io.IOException; import java.io.UnsupportedEncodingException; import java.io.CharArrayWriter; import java.nio.charset.Charset; import java.nio.charset.IllegalCharsetNameException; import java.nio.charset.UnsupportedCharsetException ; import java.util.BitSet; import java.security.AccessController; import java.security.PrivilegedAction; import sun.security.action.GetBooleanAction; import sun.security.action.GetPropertyAction; /** * Utility class for HTML form encoding. This class contains static methods * for converting a String to the application/x-www-form-urlencoded MIME * format. For more information about HTML form encoding, consult the HTML * <A HREF="http://www.w3.org/TR/html4/">specification</A>. * * <p> * When encoding a String, the following rules apply: * * <p> * <ul> * <li>The alphanumeric characters "a" through * "z", "A" through * "Z" and "0" * through "9" remain the same. * <li>The special characters ".", * "-", "*", and * "_" remain the same. * <li>The space character " " is * converted into a plus sign "+". * <li>All other characters are unsafe and are first converted into * one or more bytes using some encoding scheme. Then each byte is * represented by the 3-character string * "%xy", where xy is the * two-digit hexadecimal representation of the byte. * The recommended encoding scheme to use is UTF-8. However, * for compatibility reasons, if an encoding is not specified, * then the default encoding of the platform is used. * </ul> * * <p> * For example using UTF-8 as the encoding scheme the string "The * string ü@foo-bar" would get converted to * "The+string+%C3%BC%40foo-bar" because in UTF-8 the character * ü is encoded as two bytes C3 (hex) and BC (hex), and the * character @ is encoded as one byte 40 (hex). * * @author Herb Jellinek * @version 1.32, 04/22/06 * @since JDK1.0 */ public class URLEncoder { static BitSet dontNeedEncoding; static final int caseDiff = ("a" - "A"); static String dfltEncName = null; static { /* The list of characters that are not encoded has been * determined as follows: * * RFC 2396 states: * ----- * Data characters that are allowed in a URI but do not have a * reserved purpose are called unreserved. These include upper * and lower case letters, decimal digits, and a limited set of * punctuation marks and symbols. * * unreserved = alphanum | mark * * mark = "-" | "_" | "." | "!" | "~" | "*" | """ | "(" | ")" * * Unreserved characters can be escaped without changing the * semantics of the URI, but this should not be done unless the * URI is being used in a context that does not allow the * unescaped character to appear. * ----- * * It appears that both Netscape and Internet Explorer escape * all special characters from this list with the exception * of "-", "_", ".", "*". While it is not clear why they are * escaping the other characters, perhaps it is safest to * assume that there might be contexts in which the others * are unsafe if not escaped. Therefore, we will use the same * list. It is also noteworthy that this is consistent with * O"Reilly"s "HTML: The Definitive Guide" (page 164). * * As a last note, Intenet Explorer does not encode the "@" * character which is clearly not unreserved according to the * RFC. We are being consistent with the RFC in this matter, * as is Netscape. * */ dontNeedEncoding = new BitSet(256); int i; for (i = "a"; i <= "z"; i++) { dontNeedEncoding.set(i); } for (i = "A"; i <= "Z"; i++) { dontNeedEncoding.set(i); } for (i = "0"; i <= "9"; i++) { dontNeedEncoding.set(i); } dontNeedEncoding.set(" "); /* encoding a space to a + is done * in the encode() method */ dontNeedEncoding.set("-"); dontNeedEncoding.set("_"); dontNeedEncoding.set("."); dontNeedEncoding.set("*"); dfltEncName = (String)AccessController.doPrivileged ( new GetPropertyAction("file.encoding") ); } /** * You can"t call the constructor. */ private URLEncoder() { } /** * Translates a string into x-www-form-urlencoded * format. This method uses the platform"s default encoding * as the encoding scheme to obtain the bytes for unsafe characters. * * @param s String to be translated. * @deprecated The resulting string may vary depending on the platform"s * default encoding. Instead, use the encode(String,String) * method to specify the encoding. * @return the translated String. */ @Deprecated public static String encode(String s) { String str = null; try { str = encode(s, dfltEncName); } catch (UnsupportedEncodingException e) { // The system should always have the platform default } return str; }
--->

← →
iZEN © (2009-02-09 15:28) [29]

<---
/** * Translates a string into application/x-www-form-urlencoded * format using a specific encoding scheme. This method uses the * supplied encoding scheme to obtain the bytes for unsafe * characters. * <p> * <em><strong>Note:</strong> The <a href= * "http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars"> * World Wide Web Consortium Recommendation</a> states that * UTF-8 should be used. Not doing so may introduce * incompatibilites.</em> * * @param s String to be translated. * @param enc The name of a supported * <a href="../lang/package-summary.html#charenc">character * encoding</a>. * @return the translated String. * @exception UnsupportedEncodingException * If the named encoding is not supported * @see URLDecoder#decode(java.lang.String, java.lang.String) * @since 1.4 */ public static String encode(String s, String enc) throws UnsupportedEncodingException { boolean needToChange = false; StringBuffer out = new StringBuffer(s.length()); Charset charset; CharArrayWriter charArrayWriter = new CharArrayWriter(); if (enc == null) throw new NullPointerException("charsetName"); try { charset = Charset.forName(enc); } catch (IllegalCharsetNameException e) { throw new UnsupportedEncodingException(enc); } catch (UnsupportedCharsetException e) { throw new UnsupportedEncodingException(enc); } for (int i = 0; i < s.length();) { int c = (int) s.charAt(i); //System.out.println("Examining character: " + c); if (dontNeedEncoding.get(c)) { if (c == " ") { c = "+"; needToChange = true; } //System.out.println("Storing: " + c); out.append((char)c); i++; } else { // convert to external encoding before hex conversion do { charArrayWriter.write(c); /* * If this character represents the start of a Unicode * surrogate pair, then pass in two characters. It"s not * clear what should be done if a bytes reserved in the * surrogate pairs range occurs outside of a legal * surrogate pair. For now, just treat it as if it were * any other character. */ if (c >= 0xD800 && c <= 0xDBFF) { /* System.out.println(Integer.toHexString(c) + " is high surrogate"); */ if ( (i+1) < s.length()) { int d = (int) s.charAt(i+1); /* System.out.println("\tExamining " + Integer.toHexString(d)); */ if (d >= 0xDC00 && d <= 0xDFFF) { /* System.out.println("\t" + Integer.toHexString(d) + " is low surrogate"); */ charArrayWriter.write(d); i++; } } } i++; } while (i < s.length() && !dontNeedEncoding.get((c = (int) s.charAt(i)))); charArrayWriter.flush(); String str = new String(charArrayWriter.toCharArray()); byte[] ba = str.getBytes(charset); for (int j = 0; j < ba.length; j++) { out.append("%"); char ch = Character.forDigit((ba[j] >> 4) & 0xF, 16); // converting to use uppercase letter as part of // the hex value if ch is a letter. if (Character.isLetter(ch)) { ch -= caseDiff; } out.append(ch); ch = Character.forDigit(ba[j] & 0xF, 16); if (Character.isLetter(ch)) { ch -= caseDiff; } out.append(ch); } charArrayWriter.reset(); needToChange = true; } } return (needToChange? out.toString() : s); } }

← →
Тыдыщ (2009-02-09 15:31) [30]

Аа-а-а-а. Спамеры!!! А-а-а-а!!!! iZEN спамер!!!

← →
Maacheba (2009-02-09 15:31) [31]

iZEN, если я правильно понял этот java код, то делает он только одно:

если встречает символ процента - то следующие два символа интерпретируются как байт. Грубо говоря в понятиях дельфи: "%20" = byte(20) = " "; // пробел

и это единственное, что делате данный код. Так?

> и это единственное, что делате данный код. Так?

не может быть. для этого его слишком много

← →
Maacheba (2009-02-09 16:54) [33]

> не может быть. для этого его слишком много

тоже так думаю ))) Но Java плохо понимаю... Поэтому данный очередной копипаст не сильно помогает.

Впрочем, судя по всему, данный код все равно не отвечает на вопрос как гугл распознает кодировку. Мое предположение, что по некоторым внутренним своим вероятностым алгоритмам. Или все таки можно стопроцентно отличить WIDE строку от ANSI? Вроде бы даже у Рихтера было, что в винде есть встроенная функция, которая это делает, но делает она это только с некоторой вероятностью. То есть, в общем случае получается отличить нельзя.

Гугл, видимо, на основании строки и передаваемой браузером страны делает оценку.

> Maacheba (09.02.09 16:54) [33]
>
>
> > не может быть. для этого его слишком много
>
> тоже так думаю ))) Но Java плохо понимаю... Поэтому данный
> очередной копипаст не сильно помогает.
>
> Впрочем, судя по всему, данный код все равно не отвечает
> на вопрос как гугл распознает кодировку. Мое предположение,
> что по некоторым внутренним своим вероятностым алгоритмам.
> Или все таки можно стопроцентно отличить WIDE строку от
> ANSI? Вроде бы даже у Рихтера было, что в винде есть встроенная
> функция, которая это делает, но делает она это только с
> некоторой вероятностью. То есть, в общем случае получается
> отличить нельзя.
>

В UTF-8 английские символы передаются в однобайтовой кодировке. Символы национального алфавита в URL пока что в стадии проекта. :))

А! Нет. Национальные кодировка поддерживаются вот так:

* For example using UTF-8 as the encoding scheme the string "The * string ü@foo-bar" would get converted to * "The+string+%C3%BC%40foo-bar" because in UTF-8 the character * ü is encoded as two bytes C3 (hex) and BC (hex), and the * character @ is encoded as one byte 40 (hex).

← →
Maacheba (2009-02-10 12:36) [36]

> В UTF-8 английские символы передаются в однобайтовой кодировке

и?

> Символы национального алфавита в URL пока что в стадии проекта.
> :))

серьезно? А как по-твоему в GET запросах поисковики передают русский текст на поиск? )))))) Во ты сказал, конечно...

В общем, я понял, скопипастнуть ты скопипастнул, но внутри не разбирался...

Я в общем уже все сделал, мне теперь чисто любопытно как гугл определяет кодировку. А точнее - есть ли 100% способ это определить (я такого способа не вижу) или гугл исходит из вероятностей.

> Я в общем уже все сделал, мне теперь чисто любопытно как
> гугл определяет кодировку. А точнее - есть ли 100% способ
> это определить (я такого способа не вижу) или гугл исходит
> из вероятностей.

Так серверу сам браузер отсылает url-encoded- строку. Всё перекодирование осуществляется на стороне клиента.

← →
Maacheba (2009-02-10 23:09) [38]

iZEN © (10.02.09 18:42) [37]
Так серверу сам браузер отсылает url-encoded- строку

я понимаю, читать чужие посты - занятие скучное, гораздо приятнее их писать. Но я повторю ссылки:

http://www.google.ru/search?q=%EF%F0%E8%E2%E5%F2

http://www.google.ru/search?q=%D0%BF%D1%80%D0%B8%D0%B2%D0%B5%D1%82

Это так называемые GET запросы. Так вотб в первом случае закодировано в win-1251, а во втором случае закодировано в UTF-8.
Но гугл корректно (и одинаково) распознает и тот, и другой формат, независимо от кодировки.

← →
Maacheba (2009-02-10 23:10) [39]

Возможно, ты считаешь, что сам браузер перекодирует GET-запросы в нужную кодировку. Но:

1) откуда тогда браузеру знать в какой кодировке написан URL?

2) это не так, что видно по данному сайту же.

Вот так работает:

http://www.delphimaster.ru/cgi-bin/forum.pl?n=3&search=%EF%F0%E8%E2%E5%F2

А вот так не работает:

http://www.delphimaster.ru/cgi-bin/forum.pl?n=3&search=%D0%BF%D1%80%D0%B8%D0%B2%D0%B5%D1%82

хотя закодировано одно и тоже слово "привет", но в разных кодировках. Гугл в любом случае понимает, DM понимает только первый вариант, так как все страницы отображаются в win-1251 и он, видимо, ориентируется только на данную кодировку.

Теперь понятно?

> Maacheba (10.02.09 23:10) [39]

Начать можно отсюда: http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
Гуглится, кстати, за 2 минуты :)

← →
Maacheba (2009-02-10 23:40) [41]

Zeqfreed, все люди одинаковые, но гуглят по разному )))
В общем, как я предполагал, все оценки - статистические. Точного способа на 100% распознать кодировку - нету. Мне это и было любопытно.

> Maacheba (09.02.09 11:49)
> Indy писали индусы?
Яву- на Яве, а делфи - дельфины)..

Indy писали индусы? Найти похожие ветки