Yuan's SOA Blog: January 2013

Thursday, January 17, 2013

You can run managed server without admin server

Just to clarify, technically you can run Weblogic managed server without admin server, if the service on the managed server doesn't access anything from the admin server. Although I have never seen anyone doing this in a prod environment.

e.g. You can start the admin server, then start OSB server, create a proxy call it echoPS which just echoes back your input. Then you shutdown admin server. Test echoPS from soapUI, it would run just fine. You can also simply starts OSB server without admin server, you'll see a bunch of errors, but the managed server will eventually start. If you test your echoPS, it works just fine.

Friday, January 4, 2013

Rolling soa_server.out log file

SOA server keeps many logs. They are mostly WLS logs, such as soa_sever1.log, osb_server1.log etc. All of these files can be configured to auto roll. Except soa_server1.out.

Strictly speaking, soa_server1.out is not really WLS "log" file. If you start SOA server from a command window, by default the standard output spits all the messages in the command Window. When you start SOA server via node manager, the standard output get redirected to soa_server1.out. Since standard output is not really a log file, you cannot configure WLS to roll this file. Over time, soa_server1.out can grow very big and cause problems.

I'm sure many people have their solutions to deal with the issue. Here is a cheat I figured out that can achieve the effect of "rolling" soa_server1.out.

Write a simply script:

cp soa_server.out soa_server.out_$(date +%b)_$(date +%d)_$(date +%Y)"

echo > soa_server.out

the first command backs up the file, 2nd command truncates the file, but still keep the same file. SOA is happy to to dump data into the same "log" file. Use cron to schedule the script to run, that should be it.

One reminder, i used cp in the first step. Then truncate the file in 2nd step. I also attempted in the first step to rename soa_server1.out file, then "touch soa_server1.out" to create a new log file. It doesn't work that way. SOA server obviously has got its mind somewhere, it doesn't like you to rename the log file. Even if you create a new soa_server.out, it won't write to the new file.

OSB cluster synching problem

We ran into an OSB cluster synching problem in our production environment. The key words are the "production environment". Otherwise, it won't be an issue, because you can always restart both nodes, then it would be synched without problem. We need a solution without shutting down the cluster.

The problem happened after we updated an OSB project which contains a collection of proxies, business service etc. To be exact, we used "ant", which in turn uses WLST to run the deployment, but i don't think that's relevant.

We have two OSB nodes, after the deployment we start to see exceptions in the logs of node1 (osb_server1).

<Error> <ConfigFwk> <BEA-000000> <notifyAfterEnd() failed for listener ResourceListenerNotifier
java.lang.NoClassDefFoundError: com/bea/alsb/coherence/impl/RemoveProcessor
at com.bea.alsb.coherence.impl.CoherenceCache.removeAll(CoherenceCache.java:132)
at com.bea.wli.sb.service.resultcache.ResultCache.removeAll(ResultCache.java:254)
and
<Error> <OSB Kernel> <BEA-382016> <Failed to instantiate router for service ProxyService CIS/proxy/GetAutopayWithdrawOptionByAccountID: com.bea.wli.sb.management.BrokerManagementException: The configurations for the proxy service is in flux.
com.bea.wli.sb.management.BrokerManagementException: The configurations for the proxy service is in flux.
at com.bea.wli.sb.pipeline.RouterContext.getInstance(RouterContext.java:178)
at com.bea.wli.sb.pipeline.RouterManager.processMessagen

we had to shut down node1. Otherwise, it causes nasty service interruption. We logged a Sev-1 SR with Oracle. The response we got from the support borders on irresponsibility: they merely told us to shut down both nodes, and restart up. Hello..., this is production environment! You can't just simply use the "reboot" lame response like those Windows desktop tech support!

Additionally, Oracle support pointed out that we were deploying updates while OSB proxy is running. I am not sure I get it. Does it mean that we need to shut down the server to deploy any updates???

Anyway, I noticed the exceptions indicating some kind of issue with "coherence", so we decided to let node2 running for over 24 hours, hoping "coherence" cache may get cleared. Then we tried to start node1 again and, unfortunately, got the same error.

After poking my heads all over the place, I have a hunch that node1 (osb_server1) might be keeping its cached values somewhere under the <domain home>/servers/osb_server1.

I renamed the directory "osb_server1" to "osb_server1.bak", re-created an empty "osb_server1" directory. I restarted osb_server1 (from the Weblogic console), voila, it started up beatifically. Node1 synched up with node2 without incidents!

Lessons learned:

1. in a production environment, try to deploy OSB update using resources, not the whole project, as the whole project update has a bigger foot print that may affect many proxies and other resources. If you only changed a couple of resources, only update the changed resources so you can minimize the chance of cluster syching problem.

2. Do not just take the tech support's advice blindly. If we did, we would need to shut down a busy client portal site. Shutting down a production environment should only be the last resort, not the first and only advice to the clients.

additional note (4/29/13):

Recently, i noticed another example of OSB server is out synch. We initially added a "report" activity, however, it causes some issue with the OSB report DB. The symptom is OSB will continue try to commit the DB transaction after it fails. It creates an indefinite loop. It fills up the log file quickly and brings the server down. We quickly removed the report activity from the proxy, activated the change, and bounced both nodes of OSB. But it didn't help.

One particular behavior is the error only happens on one OSB node (node 2). My theory is when the original proxy call was made, the request was dispatched to node2, however, the request failed during db transaction, therefore it is stuck only in node2 (attempting to commit DB transaction).

I applied the same trick by blowing away osb_server2 directory. No luck :(

Eventually, I found out this deleted report activity (with one unfinished db transaction) is somehow stuck in the JMS persistent store (go figure!). In our case, the persistent store is .../osb_cluster/jms/FileStore_auto_2. We blow this file away (gee, not sure we should just do that), restarted node2, that took care of the problem.