背景介紹
最近再做一個RSS閱讀工具給自己用,其中一個環(huán)節(jié)是從服務(wù)器端獲取一個包含了RSS源列表的json文件,再根據(jù)這個json文件下載、解析RSS內(nèi)容悠砚。核心代碼如下:
class PresenterImpl(val context: Context, val activity: MainActivity) : IPresenter {
private val URL_API = "https://vimerzhao.github.io/others/rssreader/RSS.json"
override fun getRssResource(): RssSource {
val gson = GsonBuilder().create()
return gson.fromJson(getFromNet(URL_API), RssSource::class.java)
}
private fun getFromNet(url: String): String {
val result = URL(url).readText()
return result
}
......
}
之前一直執(zhí)行地很好,直到前兩天我購買了一個vimerzhao.top
的域名偷仿,并將原來的域名vimerzhao.github.io
重定向到了vimerzhao.top
哩簿。這個工具就無法使用了,但在瀏覽器輸入URL_API
卻能得到數(shù)據(jù):
那為什么URL.readText()
沒有拿到數(shù)據(jù)呢酝静?
不支持重定向
可以通過下面代碼測試:
import java.net.*;
import java.io.*;
public class TestRedirect {
public static void main(String args[]) {
try {
URL url1 = new URL("https://vimerzhao.github.io/others/rssreader/RSS.json");
URL url2 = new URL("http://vimerzhao.top/others/rssreader/RSS.json");
read(url1);
System.out.println("=--------------------------------=");
read(url2);
} catch (Exception e) {
e.printStackTrace();
}
}
public static void read(URL url) {
try {
BufferedReader in = new BufferedReader(
new InputStreamReader(url.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
System.out.println(inputLine);
}
in.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
得到結(jié)果如下:
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx</center>
</body>
</html>
=--------------------------------=
{"theme":"tech","author":"zhaoyu","email":"dutzhaoyu@gmail.com","version":"0.01","contents":[{"category":"綜合版塊","websites":[{"tag":"門戶網(wǎng)站","url":["http://geek.csdn.net/admin/news_service/rss","http://blog.jobbole.com/feed/","http://feed.cnblogs.com/blog/sitehome/rss","https://segmentfault.com/feeds","http://www.codeceo.com/article/category/pick/feed"]},{"tag":"知名社區(qū)","url":["https://stackoverflow.com/feeds","https://www.v2ex.com/index.xml"]},{"tag":"官方博客","url":["https://www.blog.google/rss/","https://blog.jetbrains.com/feed/"]},{"tag":"個人博客-行業(yè)","url":["http://feed.williamlong.info/","https://www.liaoxuefeng.com/feed/articles"]},{"tag":"個人博客-學(xué)術(shù)","url":["http://www.norvig.com/rss-feed.xml"]}]},{"category":"編程語言","websites":[{"tag":"Kotlin","url":["https://kotliner.cn/api/rss/latest"]},{"tag":"Python","url":["https://www.python.org/dev/peps/peps.rss/"]},{"tag":"Java","url":["http://www.codeceo.com/article/category/develop/java/feed"]}]},{"category":"行業(yè)動態(tài)","websites":[{"tag":"Android","url":["http://www.codeceo.com/article/category/develop/android/feed"]}]},{"category":"亂七八遭","websites":[{"tag":"Linux-綜合","url":["https://linux.cn/rss.xml","http://www.linuxidc.com/rssFeed.aspx","http://www.codeceo.com/article/tag/linux/feed"]},{"tag":"Linux-發(fā)行版","url":["https://blog.linuxmint.com/?feed=rss2","https://manjaro.github.io/feed.xml"]}]}]}
HTTP返回碼301节榜,即發(fā)生了重定向”鹬牵可在瀏覽器上這個過程太快以至于我們看不到這個301界面的出現(xiàn)宗苍。這里需要說明的是URL.readText()
是Kotlin中一個擴(kuò)展函數(shù)
,本質(zhì)還是調(diào)用了URL
類的openStream
方法薄榛,部分源碼如下:
.....
/**
* Reads the entire content of this URL as a String using UTF-8 or the specified [charset].
*
* This method is not recommended on huge files.
*
* @param charset a character set to use.
* @return a string with this URL entire content.
*/
@kotlin.internal.InlineOnly
public inline fun URL.readText(charset: Charset = Charsets.UTF_8): String = readBytes().toString(charset)
/**
* Reads the entire content of the URL as byte array.
*
* This method is not recommended on huge files.
*
* @return a byte array with this URL entire content.
*/
public fun URL.readBytes(): ByteArray = openStream().use { it.readBytes() }
所以上面的測試代碼即說明了URL.readText()
失敗的原因讳窟。
不過URL
不支持重定向是否合理?為什么不支持敞恋?還有待探究窥突。
不穩(wěn)定的equals
方法
首先看下equals
的說明(URL (Java Platform SE 7 )):
Compares this URL for equality with another object.
If the given object is not a URL then this method immediately returns false.
Two URL objects are equal if they have the same protocol, reference equivalent hosts, have the same port number on the host, and the same file and fragment of the file.
Two hosts are considered equivalent if both host names can be resolved into the same IP addresses; else if either host name can't be resolved, the host names must be equal without regard to case; or both host names equal to null.
Since hosts comparison requires name resolution, this operation is a blocking operation.
Note: The defined behavior for equals is known to be inconsistent with virtual hosting in HTTP.
接下來再看一段代碼:
import java.net.*;
public class TestEquals {
public static void main(String args[]) {
try {
// vimerzhao的博客主頁
URL url1 = new URL("https://vimerzhao.github.io/");
// zhanglanqing的博客主頁
URL url2 = new URL("https://zhanglanqing.github.io/");
// vimerzhao博客主頁重定向后的域名
URL url3 = new URL("http://vimerzhao.top/");
System.out.println(url1.equals(url2));
System.out.println(url1.equals(url3));
} catch (Exception e) {
e.printStackTrace();
}
}
}
根據(jù)定義輸出結(jié)果是什么呢孔祸?運(yùn)行之后是這樣:
true
false
你可能猜對了,但如果我把電腦斷網(wǎng)之后再次執(zhí)行,結(jié)果卻是:
false
false
但其實(shí)3個域名的IP地址都是相同的凳谦,可以ping
一下:
zhaoyu@Inspiron ~/Project $ ping vimezhao.github.io
PING sni.github.map.fastly.net (151.101.77.147) 56(84) bytes of data.
64 bytes from 151.101.77.147: icmp_seq=1 ttl=44 time=396 ms
^C
--- sni.github.map.fastly.net ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 396.692/396.692/396.692/0.000 ms
zhaoyu@Inspiron ~/Project $ ping zhanglanqing.github.io
PING sni.github.map.fastly.net (151.101.77.147) 56(84) bytes of data.
64 bytes from 151.101.77.147: icmp_seq=1 ttl=44 time=396 ms
^C
--- sni.github.map.fastly.net ping statistics ---
2 packets transmitted, 1 received, 50% packet loss, time 1000ms
rtt min/avg/max/mdev = 396.009/396.009/396.009/0.000 ms
zhaoyu@Inspiron ~/Project $ ping vimezhao.top
ping: unknown host vimezhao.top
zhaoyu@Inspiron ~/Project $ ping vimerzhao.top
PING sni.github.map.fastly.net (151.101.77.147) 56(84) bytes of data.
64 bytes from 151.101.77.147: icmp_seq=1 ttl=44 time=409 ms
^C
--- sni.github.map.fastly.net ping statistics ---
2 packets transmitted, 1 received, 50% packet loss, time 1001ms
rtt min/avg/max/mdev = 409.978/409.978/409.978/0.000 ms
首先看一下有網(wǎng)絡(luò)連接的情況体谒,vimerzhao.github.io
和zhanglanqing.github.io
是我和我同學(xué)的博客江兢,雖然內(nèi)容不一樣但是指向相同的IP实檀,協(xié)議、端口等都相同衬横,所以相等了裹粤;而vimerzhao.github.io
雖然和vimerzhao.top
指向同一個博客,但是一個是https
一個是http
蜂林,協(xié)議不同遥诉,所以判斷為不相等。相信這和大多數(shù)人的直覺是相背的:指向不同博客的URL相等了噪叙,但指向相同博客的URL卻不相等突那!
再分析斷網(wǎng)之后的結(jié)果:首先查看URL
的源碼:
public boolean equals(Object obj) {
if (!(obj instanceof URL))
return false;
URL u2 = (URL)obj;
return handler.equals(this, u2);
}
再看handler
對象的源碼:
protected boolean equals(URL u1, URL u2) {
String ref1 = u1.getRef();
String ref2 = u2.getRef();
return (ref1 == ref2 || (ref1 != null && ref1.equals(ref2))) &&
sameFile(u1, u2);
}
sameFile
源碼:
protected boolean sameFile(URL u1, URL u2) {
// Compare the protocols.
if (!((u1.getProtocol() == u2.getProtocol()) ||
(u1.getProtocol() != null &&
u1.getProtocol().equalsIgnoreCase(u2.getProtocol()))))
return false;
// Compare the files.
if (!(u1.getFile() == u2.getFile() ||
(u1.getFile() != null && u1.getFile().equals(u2.getFile()))))
return false;
// Compare the ports.
int port1, port2;
port1 = (u1.getPort() != -1) ? u1.getPort() : u1.handler.getDefaultPort();
port2 = (u2.getPort() != -1) ? u2.getPort() : u2.handler.getDefaultPort();
if (port1 != port2)
return false;
// Compare the hosts.
if (!hostsEqual(u1, u2))
return false;// 無網(wǎng)絡(luò)連接時會觸發(fā)這一句
return true;
}
最后是hostsEqual
的源碼:
protected boolean hostsEqual(URL u1, URL u2) {
InetAddress a1 = getHostAddress(u1);
InetAddress a2 = getHostAddress(u2);
// if we have internet address for both, compare them
if (a1 != null && a2 != null) {
return a1.equals(a2);
// else, if both have host names, compare them
} else if (u1.getHost() != null && u2.getHost() != null)
return u1.getHost().equalsIgnoreCase(u2.getHost());
else
return u1.getHost() == null && u2.getHost() == null;
}
在有網(wǎng)絡(luò)的情況下,a1
和a2
都不是null
所以會觸發(fā)return a1.equals(a2)
构眯,返回true
;而沒有網(wǎng)絡(luò)時則會觸發(fā)return u1.getHost().equalsIgnoreCase(u2.getHost());
即第二個判斷早龟,顯然url1
的host
(vimerzhao.github.io
)和url2
的host
(zhanglanqing.github.io
)不等惫霸,所以返回false
猫缭,導(dǎo)致if (!hostsEqual(u1, u2))
判斷為真,return false
執(zhí)行壹店。
可見猜丹,URL
類的equals
方法不僅違反直覺還缺乏一致性,在不同環(huán)境會有不同結(jié)果硅卢,十分危險射窒!
耗時的equals
方法
此外,equals
還是個耗時的操作将塑,因?yàn)樵谟芯W(wǎng)絡(luò)的情況下需要進(jìn)行DNS解析脉顿,hashCode()
同理,這里以hashCode()
為例說明点寥。URL
類的hashCode()
源碼:
public synchronized int hashCode() {
if (hashCode != -1)
return hashCode;
hashCode = handler.hashCode(this);
return hashCode;
}
handler
對象的hashCode()
方法:
protected int hashCode(URL u) {
int h = 0;
// Generate the protocol part.
String protocol = u.getProtocol();
if (protocol != null)
h += protocol.hashCode();
// Generate the host part.
InetAddress addr = getHostAddress(u);
if (addr != null) {
h += addr.hashCode();
} else {
String host = u.getHost();
if (host != null)
h += host.toLowerCase().hashCode();
}
// Generate the file part.
String file = u.getFile();
if (file != null)
h += file.hashCode();
// Generate the port part.
if (u.getPort() == -1)
h += getDefaultPort();
else
h += u.getPort();
// Generate the ref part.
String ref = u.getRef();
if (ref != null)
h += ref.hashCode();
return h;
}
其中getHostAddress()
會消耗大量時間艾疟。所以,如果在基于哈希表的容器中存儲URL
對象敢辩,簡直就是災(zāi)難蔽莱。下面這段代碼,對比了URL
和URI
在存儲50次時的表現(xiàn):
import java.net.*;
import java.util.*;
public class TestHash {
public static void main(String args[]) {
HashSet<URL> list1 = new HashSet<>();
HashSet<URI> list2 = new HashSet<>();
try {
URL url1 = new URL("https://vimerzhao.github.io/");
URI url2 = new URI("https://zhanglanqing.github.io/");
long cur = System.currentTimeMillis();
int cnt = 50;
for (int i = 0; i < cnt; i++) {
list1.add(url1);
}
System.out.println(System.currentTimeMillis() - cur);
cur = System.currentTimeMillis();
for (int i = 0; i < cnt; i++) {
list2.add(url2);
}
System.out.println(System.currentTimeMillis() - cur);
} catch (Exception e) {
e.printStackTrace();
}
}
}
輸出為:
271
0
所以戚长,基于哈希表實(shí)現(xiàn)的容器最好不要用URL
盗冷。
TrailingSlash
的作用
所謂TrailingSlash
就是域名結(jié)尾的斜杠。比如我們在瀏覽器看到vimerzhao.top
,復(fù)制后粘貼發(fā)現(xiàn)是http://vimerzhao.top/
同廉。首先用下面代碼測試:
import java.net.*;
import java.io.*;
public class TestTrailingSlash {
public static void main(String args[]) {
try {
URL url1 = new URL("https://vimerzhao.github.io/");
URL url2 = new URL("https://vimerzhao.github.io");
System.out.println(url1.equals(url2));
outputInfo(url1);
outputInfo(url2);
} catch (Exception e) {
e.printStackTrace();
}
}
public static void outputInfo(URL url) {
System.out.println("------" + url.toString() + "----------");
System.out.println(url.getRef());
System.out.println(url.getFile());
System.out.println(url.getHost());
System.out.println("----------------");
}
}
得到結(jié)果如下:
false
------https://vimerzhao.github.io/----------
null
/
vimerzhao.github.io
----------------
------https://vimerzhao.github.io----------
null
vimerzhao.github.io
----------------
其實(shí)仪糖,無論用前面的read()
方法讀或者地址欄直接輸入url,url1
和url2
的內(nèi)容都是相同的恤溶,但是加/
表示這是一個目錄乓诽,不加表示這是一個文件,所以二者getFile()
的結(jié)果不同咒程,導(dǎo)致equals
判斷為false
鸠天。在地址欄輸入時甚至不會覺察到這個TrailingSlash
,所返回的結(jié)果也一樣帐姻,但equals
判斷竟然為false
稠集,真是防不勝防!
這里還有一個問題就是:一個是文件饥瓷,令一個是目錄剥纷,為什么都能得到相同結(jié)果?
調(diào)查一番后發(fā)現(xiàn):其實(shí)再請求的時候如果有/
呢铆,那么就會在這個目錄下找index.html
文件晦鞋;如果沒有,以vimerzhao.top/tags
為例,則會先找tags
悠垛,如果找不到就會自動在后面添加一個/
线定,再在tags
目錄下找index.html
文件。如圖:
這里有一個有趣的測試确买,編寫兩段代碼如下:
import java.net.*;
import java.io.*;
public class TestTrailingSlash {
public static void main(String args[]) {
try {
URL urlWithSlash = new URL("http://vimerzhao.top/tags/");
int cnt = 5;
long cur = System.currentTimeMillis();
for (int i = 0; i < cnt; i++) {
read(urlWithSlash);
}
System.out.println(System.currentTimeMillis() - cur);
} catch (Exception e) {
e.printStackTrace();
}
}
public static void read(URL url) {
try {
BufferedReader in = new BufferedReader(
new InputStreamReader(url.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
//System.out.println(inputLine);
}
in.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
import java.net.*;
import java.io.*;
public class TestWithoutTrailingSlash {
public static void main(String args[]) {
try {
URL urlWithoutSlash = new URL("http://vimerzhao.top/tags");
int cnt = 5;
long cur = System.currentTimeMillis();
for (int i = 0; i < cnt; i++) {
read(urlWithoutSlash);
}
System.out.println(System.currentTimeMillis() - cur);
} catch (Exception e) {
e.printStackTrace();
}
}
public static void read(URL url) {
try {
BufferedReader in = new BufferedReader(
new InputStreamReader(url.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
//System.out.println(inputLine);
}
in.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
使用如下腳本測試:
#!/bin/sh
for i in {1..20}; do
java TestTrailingSlash > out1
java TestWithoutTrailingSlash > out2
done
將輸出的時間做成表格:
可以發(fā)現(xiàn)斤讥,添加了/
的速度更快,這是因?yàn)?strong>省去了查找是否有tags
文件的操作湾趾。這也給我們啟發(fā):URL結(jié)尾的/
最好還是加上芭商!
以上,本周末發(fā)現(xiàn)的一些坑搀缠。
參考
- Official Google Webmaster Central Blog: To slash or not to slash
- url rewriting - When should I use a trailing slash in my URL? - Stack Overflow
- What Does a Slash at the End of a Website's URL Mean?
- Mr. Gosling - why did you make URL equals suck?!? - Invert Your Mind ? Invert Your Mind
- java - URLConnection Doesn't Follow Redirect - Stack Overflow
- java - Proper way to check for URL equality - Stack Overflow
- http - How to compare two URLs in java? - Stack Overflow